 Hello everyone, and welcome to the QtCon talk on scaling QtFlow for multi-tenancy at Spotify. Today, Jonathan and I are going to share the story on how we scale up our QtFlow platform to serve the growing number of ML teams at Spotify. Let's start with the introduction. My name is Kechi Dai, I'm a senior ML info engineer at Spotify. This is my teammate, Jonathan Jean, also a senior engineer. You will hear from him during the later part of this talk. So we are Spotify, if you haven't heard of us before, we are an audio streaming service. We launched in 2008, and now we have more than 365 million monthly active users. On our platform, we have more than 70 million tracks and almost 3 million podcast titles. And our service is available in 178 markets all over the world. Streamlining is at the heart of almost everything we do at Spotify, including recommending personalized content on the homepage, optimizing the ranking results from the search, or helping you discover and explore the music you haven't listened to before. It enables us to recommend artists, paylists, and podcasts to keep our users active, engaged, and more likely to subscribe in the long term. To power ML products at Spotify, our team is building a standardized machine learning platform to provide our engineers with tools and environment to quickly prototype, experiment, and productionize their ML ideas. Our platform internally known as Qflow Platform consists of two major components, a Python SDK for building ML workflow with TFX components, and several managed Qflow GKE clusters for ML pipeline executions. For those who don't know, TFX TensorFlow Extended is a component-based ML framework around TensorFlow ecosystem. Qflow is a set of machine learning toolkits on top of Kubernetes. On our platform, we mainly use Qflow pipelines to orchestrate ML workflows built with TFX. This is a typical ML workflow at Spotify. As you can see, it has a sequence of components representing different steps in a machine learning pipeline. It starts with future engineering, future collector and transform components, assemble the raw features, and transform them into the meaningful ones for model training. Next, we have steps and schema generation components to validate those features and produce a schema file based on future data steps. Then we have the trainer component for model training and the evaluated component for the model performance analysis. In the end, hopefully everything looks good, then we can deploy the model to production through the deployer. But of course, in reality, an actual ML pipeline can be much more complicated than this example. It also includes user-custom components containing different business logics. After a pipeline is authored, teams can submit a pipeline to our GK cluster and it will get executed through Qflow pipelines. On our cluster, we installed STO for service discovery and authentication, we now metadata start to track TFX component executions and our own toolings for team management and metric monitoring. This Qflow platform is extremely valuable since it helps us better manage our ML workloads and accelerates the pace of model experimentation and rollout. After we understand a little bit more about how ML is done as Spotify, let's focus on Kubernetes part, our Qflow clusters. For those who attended KubeCon North America 2019, you might remember we shared our story on how we build and manager the Qubeflow clusters we first launched platform. Let's do a quick recap here. Before I start diving into the details, I want to remind everyone that everything we do at Spotify is on Google Cloud. Because of this, we are able to make a lot of assumptions and utilize many Google Cloud services when we build our platform. While Qubeflow clusters are built through the same process, the GCP resources are created and managed by Terraform, including GKE clusters, Cloud SQL instances, service counts, and more. Terraform is a tool for building, changing, and versioning the infrastructure so the infrastructure can be traded as code. On the other hand, Qflow related to Kubernetes resources are organized by customized and they are deployed through the Qube Cuto. Customize decomposes resource files into base and overlay files, which allows us to create a customization layer on top of the open source solution. We first launched our platform in 2019, and we had a beta release in 2020. We will go GA in early next year. So far, we have more than 60 teams and almost 600 users using our platform. There were 30,000 models trained and 100,000 pipeline execution hours run on our platform last year. In average, there are close 300 pipeline runs every day. It has been a great journey for us to witness our platform's growth and be part of it. In the following sections, Jonathan and I are going to talk about the challenges we were facing while we scale up our platform and how we address those challenges. More specifically, we are going to cover how do we support a growing number of ML teams on our platform to allow them to operate in an isolated and self-manageable environment. Second, how do we deal with upstream breaking changes by using the multi-cluster strategy? Then we are also going to cover our GitOps framework for continuous deployment to measure multiple clusters and complex deployments. In the end, we will talk about cluster observability for targeted and uniform reliability across multiple clusters. Let's start with a team-based multi-tenant spot. When we started working with Kubeflow, there's no multi-tenancy spot. Everything was running on a single namespace, including Kubeflow services and actual user pipelines. In early 2020, multi-tenancy was introduced in Kubeflow pipelines and we were one of the first teams that worked with Google to adopt it into our own platform. It's based on the open-source version of Kubeflow profile component. Each profile corresponds to a namespace on a cluster. And you can configure the owner, contributors, and a service count used for the namespace. Although it provided basic multi-tenancy features, it came with several drawbacks that didn't work well with our internal team structure. First, the Kubeflow profile's owner has to be a real user, but not a team represented by a group email. This is not ideal because people are moving among teams. If the owner leaves the team, the namespace needs to be reconstructed. Second, the contributor of a namespace has to be managed manually. This is no standard process to do that. If we build a new cluster, the owner of a namespace would have to configure all contributors again. Third, it doesn't support a Google group, and all contributors need to be added individually. This is tedious, especially if you have a big team. As Spotify, each email team operates in its own GCP project. Members in the team share the same access to the data and resources in the project. We need team-based multi-tenancy support in our Kubeflow cluster, so that everyone in the team can have the access as soon as the namespace is created. Meanwhile, with this setup, we can also conveniently obtain insights regarding team-level operational metrics and resource consumptions to better understand and manage our platform. With this in mind, we have developed our own team management tooling for Kubeflow cluster. In our setup, each team profile on namespace is owned by a Google group that corresponds to an internal LDP group, which removes the need of an individual user being the owner. To support the team concept in the Kubeflow profile, we create the customer resource definition to define RBAC and STL rules for team members and additional contributors. We also created a corresponding Kubernetes controller to support the Google group, so it automatically extracts members from the group and configs necessary permissions for them. The controller also periodically syncs with the Google group service, so any changes in the team will be automatically reflected on the cluster. Lastly, we use customized to manage our team configs and deploy them through the GitOps. An engineer in a team can submit a pull request to repo where all the profiles are stored, then the CD runs customized and verifies the changes. Once the PI is merged, the master build invokes customized to render resources and deploy them to the clusters. This process also allows us to easily reproduce and port the same team namespace setup to a different cluster. We just simply need to add it to a list of clusters defined in the CI CD. Let's take a look of an actual team profile example. The name field defines the desired name for the team profile and the namespace. The group email field should be a group email for the team that owns the profile. The profile also requires GCP ServiceCon to access datasets and GCP resources from its namespace. It's bounded with a Kubernetes ServiceCon through the GKE workload identity. Contributed field defines a list of additional users that the team would like to give access to. They can be individuals or groups. The last of quota override field allows the team to override the default resource quota defined by the platform team, but it's subject to the administrator's approval. After team configuration is defined, our tool automatically converts it into a set of Kubernetes resource files. As you can see on the right side, it has a Kubeflow profile YAML and an example namespace folder. In that namespace folder, there are CRD resource files that define the RBAC and STO rules for team members and contributors. Our controller will pick it up and set it permissions accordingly after it's deployed. It also has a limit range file that defines the default resource request and limit for containers running in that namespace. Meanwhile, customization file assembles different resource manifests needed for that namespace so they can be deployed all together. This flexible structure also allows users to add more custom resources to the team namespace if they are needed in the future. Another topic I would like to talk about is our new Kubeflow multi-cluster strategy. Since the landscape of ML is evolving so fast and we are constantly dealing with the infrastructure upgrades as well as the breaking changes coming with them, our initial platform setup consists of three clusters. We have an experimental cluster for internal infrastructure prototyping and testing. A proud cluster is for production pipelines as well as LHawk ML workloads. We also have a dev cluster intended for platform developers, but it also serves as a backup cluster when the prod is under upgrade or maintenance. This setup makes rolling out a new version of Kubeflow painful and slow. We usually first test it internally on our experimental cluster, then apply the changes to the dev and then wait for our users to upgrade the pipeline before we could eventually roll it out to the prod. The process usually takes weeks to complete. To support the rollout, sometimes we need to handle breaking changes in our client SDK or install multiple versions of Kubeflow services on cluster for the backward compatibility. On the other hand, during the final production rollout, we were under pressure to complete the upgrade ASAP to minimize the user interruption. Even worse, this approach forced the user to upgrade their ML pipelines when we upgraded infrastructure so teams couldn't choose to migrate based on their own schedules. As a result, we have implemented a multi-cluster strategy for our Kubeflow platform. Since our platform is made of a Python SDK and a managed cluster, we decided to version our infrastructure along with the SDK. The pipeline is guaranteed to be able to roll on a cluster with the same major version of the SDK that builds it. The cluster installed a set of services with the version compatible with each other as well as client SDK. As shown in the illustration, suppose we are committed to support three major versions. We will have three dedicated clusters. The pipeline is always submitted to the cluster with the same major version. In addition, we have an extra on-demand cluster as the backup configs through the telephone so it can be quickly booted up if one of our clusters goes down. So why is this better? It first encapsulates the entire Kubeflow stack from the pipeline SDK to the execution engine to ensure the compatibility. Second, it decouples the infrastructure upgrade cadence from the pipeline upgrade cadence. Users can upgrade their pipelines based on their own priorities and schedules. It also well defines the platform's supporting scope. Users will have to upgrade their pipelines before we deprecate the old cluster so we can clearly communicate with our users. Last but not least, we no longer need to worry about breaking changes and backward compatibility. It can be simply addressed by offering users a new cluster. But what does that mean for our team? This multi-cluster strategy definitely required us to have more sophisticated setup for our cluster management. So we consolidated our Terraform process to encapsulate the entire cluster creation logic in one module so different clusters can be easily cookie-cut. We also developed our Kubeflow deployment blueprints by taking the advantage of customized and GitOps framework to manage the Kubernetes resources. Meanwhile, we centralized ML workflow metadata in our own metadata service so people can still compare the performance of the models produced from different clusters. Similarly, we are also going to implement an aggregate experiment page so folks can still view all the pipeline runs in one place. Finally, this is an illustration of our multi-cluster-based Kubeflow platform. Each Kubeflow cluster is built through our Kubeflow deployment blueprints. On top of that, we have a centralized ML workflow metadata service and a unified UI for pipeline experiments and runs. Next, I'll hand over the talk to my teammate, Jonathan. He will talk more in details about how we improve the Kubeflow deployment process for our new multi-cluster setup. Hey everybody, I'm Jonathan Jin. Like I said, I'm going to talk about the work that we've done around deployment monitoring and metrics in support of the multi-cluster strategy that you all just heard about. More broadly, however, we'll be focusing on derivative infrastructural challenges resulting from that increased complexity and what we have done to address them. To start, like Keshia said, we've increased our investment in Terraform pretty significantly in the lead-up to our multi-cluster offering, going all in on our infrastructure's code strategy. This has enabled us to tackle several new challenges that have cropped up as a result of multi-cluster. For example, now that we have all these users facing clusters, how do we ensure consistency between them? And in places where they do need to diverge from the standard configuration in meaningful ways, how do we manage those deviations sustainably and systematically? At the same time, we have all these clusters now and that inherently introduces new overhead and manual toil around the pretty fundamental task of applying changes to those clusters. Lastly, there is the ongoing challenge of managing these implicit dependencies between Kubernetes resources and Kuflo resources when deploying to those clusters, as well as any sort of bootstrapping processes that need to be run systematically on each new cluster to get it ready for incoming user traffic and so on. In response to all of these new and amplified challenges, we decided to adopt Argo CD as the framework around which to base our cluster deployment operations moving forward. This is a natural extension of our existing get-up setup. Now, instead of applying all of our changes by hand to each user facing cluster that we oversee and maintaining a loose mental model of what dependencies exist between what resources and what follow-up steps need to be taken to bootstrap, we can encode all of that very formally into our Argo CD setup. This also brings with it several nice paradigms, including deployment parameterization to really drive home the idea that some clusters will inherently need to differ from their peers in different ways. Now, if we return back to this visualization Keshi presented earlier of this fleet of user facing clusters, now, rather than manage all of this by hand in a very manual and toil-sum process, we can now bring Argo CD into the fray as our deployments broker in a sense. Now, we as cluster maintainers can make our requisite changes to the resource manifests in our deployment blueprints repository as before. These changes then in turn will get picked up by Argo CD on our behalf to then propagate accordingly to all of these clusters. This streamlines the deployment process pretty significantly and it allows our team to focus on higher level requirements while offloading that wrote mechanical work of propagating these changes to Argo CD. More critically, in doing so, we ensure greater reliability for users of this product and less chance of breakages or inconsistencies as a result. So with that, let's just focus a bit now and talk a little bit about observability in this new multi-cluster world. As adoption grows for our QVIL platform, with all these new clusters and new users, the potential cost of any outage or inefficiency grows accordingly. The key insight that our team came to here is that users are not alerts and they should not act as an awarding system. What I mean by this is that our users should not be the first to know about any issues in our clusters. And by extension, we as cluster owners and maintainers should know about those issues before anybody else. In other words, if a user has been impacted by an outage or an inefficiency or a regression, it's too late, basically. And with multi-cluster, we now have more complex of observability needs. Concretely, we need to ensure parity in instrumentation, not just in existing clusters, but in any new ones that we might spin up in the future. They all need the same guarantees, the same instrumentation, the same alerting, and so on. To that effect, we took further advantage of the infrastructure as code paradigm with our Terraform setup. More fundamentally, however, we expanded our notion of infrastructure and what that really entails. At first glance, one might consider infrastructure as solely the compute resources and the networking, for example, that's needed to run your cluster and its constituent services. But we argue that, really, it's so much more than that. It includes not just the cluster resources themselves, but also all the configurations and auxiliary tooling that's needed in order to work effectively with those clusters. These might include alerts, SOOs that you might want to track, dashboards for real-time outage, triaging, and on-call operations, and any auxiliary deployments or sidecars that you might need in order to supplement the core parts of your product offering. So with that, going back to our Terraform setup for a second, on top of using Terraform to configure and provision compute resources, node pools, database instances, and similar other more, we'll call it, concrete infrastructural requirements, we can extend that configurability to standardized dashboards, alert policies, and formalized SLOs, all tracked within GCP and all configured using the official GCP Terraform provider. That's not all, however. The paradigm becomes all the more powerful once you take fuller advantage of Terraform modules to encapsulate all these constructs in a parameterized logical representation of a single cluster. With that, we can now stamp out new clusters, much like creating new instances of a class or a struct in your typical programming language. But now they also come with batteries included, essentially, and all the requisite tooling on top of the raw compute comes for free as well. This entire pattern allows us to spin up and tear down entire Kubeflow clusters with ease, and all this with ironclad guarantees that everything is set up for each cluster exactly the way that it should be and we expect it to. Speaking of observability, however, let's talk a little bit more specifically about SLO tracking for our Kubeflow platform. As the platform matures, we have an increased need for visibility with regards to performance, stability, and reliability. This applies not just to us as cluster owners, but to our users as well. They want to know, and really they deserve to know, what kind of performance they can expect from our clusters if they're going to adopt use themselves. Like I touched on earlier, we take heavy advantage of GCP's native tooling for SLO tracking via Terraform. This tooling gives us a lot of nice bells and whistles for SLO tracking. For example, we can track what's called at the error budget and the burn rate, which gives us ready insight into how close we are at any given time to violating our SLO. We can also define alerts on top of the error budget and the burn rate, such that if we're starting to run low on error budget, for example, we can take appropriate remediations to stay within budget for that time frame and prevent ourselves in advance from violating our own SLOs. On that note, let's talk about the foundation of all of these things, metrics. Our product has the luxury of having a very extensive buffet of out-of-the-box metrics to choose from, and these are all provided by our respective dependencies, including Istio, Kubeflow, and also Kubernetes itself via KubeState metrics. However, we noticed pretty early on that a lot of these metrics in trying to be as general as possible and avoid domain specificity often don't outright enable us to track specifically what matters to us and what is critical to our product offering. And even then, cases where it was in fact possible to do so oftentimes required these really lengthy, really arcade, really verbose recording rules and extensive promql arithmetic to the point where maintainability became an issue. To that effect, we implemented what we call very appropriately Kubeflow state metrics. We took heavy inspiration from KubeState metrics to create an analogous custom metrics exporter, specifically targeting Kubeflow use cases. Specifically like KubeState metrics, when deployed to our clusters, Kubeflow state metrics will listen on Kubernetes events, such as pod creation or scheduling changes, and translate those into appropriate product-specific Prometheus metrics. These might include tracking how long Kubeflow pipeline pods take to start running, how long those pods stay on the cluster once execution concludes, and so on. We can then use those metrics downstream in SLOs, alerts, dashboards, and what have you. And since Kubeflow itself is based on Kubernetes, this unlocks a very powerful usage pattern in that we can effectively instrument and track Kubeflow behavior via the native Kubernetes API without once ever needing to modify patch or fork Kubeflow code itself. With all of that, I want to talk a little now about some of the key lessons that our team has learned throughout all of this work. The first lesson, and this is one that I touched on before, is that your infrastructure is more than just compute. In particular, it is, in our opinion, worth opening yourself to thinking about your cluster infrastructure in more abstractors. Getting into the habit of managing less concrete aspects of your cluster operations, like dashboards and alerts, with the same amount of rigor, will open the door to a surprising new opportunities for standardization and formalization. Secondly, I am a very strong proponent of investing in observability and reliability preemptively, before you really, really, truly desperately need it. At the very least, it's worth having a plan in place. At the end of the day, users should be afforded the luxury of spending as little time as possible, ideally never having to think about the underlying infrastructure of their work. Investing in observability and reliability in advance empowers teams like ours to let users focus on the problems that truly matter to them, and freeing them from concerns around flakiness, inconsistency, or wondering if something is their fault or a bug in the Kubeflow platform, etc. And we delay those investments at our own risk. They can be really difficult and time consuming to put in place reactively after the fact, and all the while users are being frustrated by outages, confusing behavior, and continuing to lose confidence in your platform over time. And lastly, I want to encourage expanding your notion of what your product truly is. Taking the Spotify Kubeflow platform as an example, we argue that our platform, while it's based primarily around Kubeflow, is not exclusively Kubeflow itself. Really, what we've done is treat Kubeflow as an extensible foundation that we molded, customized, extended, and continue to do so, really, to fit our ever-growing platform needs. And these include all the new functionality that we described, such as multi-tenancy, multi-cluster support, reliability instrumentation, and custom metrics. Opening yourself up to the idea of extending your foundation rather than being strictly tethered to it per se, and really taking full advantage of Kubernetes as the common abstraction layer really opens the door to some very powerful and very compelling usage patterns. In closing, Keshi and I would like to give you all a preview of what's in store for the Kubeflow platform here at Spotify. First off, we plan to increase our investment in observability. We're thinking of this in terms of both reliability engineering and also user-facing functionality. In the former case, we're looking at truly formalizing metrics as part of our platform, possibly taking advantage of Prometheus Operator to treat recording rules and Prometheus Awards as managed resources within our Kubernetes clusters. And we'll also be looking at ways to expose user pipeline metrics to users themselves for integration into their own alerting and monitoring setups. We intend as well to increase our investment in what I like to call on-cluster compute. Currently, the bulk of users' Kubeflow computations end up getting outsourced to higher-level managed GCP services, such as Dataflow for data engineering and cloud AI platform for model training. However, this execution model does not always mesh well with users. Maybe they're concerned about the cloud costs or their trading needs are esoteric enough that they would like fuller control over the execution model of their pipelines. We'll be looking at the solutions for users like those that will empower them with the flexibility to take advantage of training patterns that don't cleanly fit into what into that user journey. We also want to invest more in users' ability to manage the particulars of their Kubeflow journey themselves, basically. This might include management of service accounts, permissions, retention policies, and cluster quota overrides to allow them to temporarily take up or consume more resources at a time for intensive compute. Doing so would give advanced users greater control over their own platform usage, while still allowing newcomers to quickly get started with sensible defaults that we provide for them. And lastly, we believe very strongly in contributing back to the open source community that has empowered so much of our own work. Case in point, a lot of our work, such as the Kubeflow-centric metrics exporter that I mentioned before, has been designed from day one for generality and open source ability. As such, we're actively looking at ways to open source some of our work around Kubeflow in production. Details are kind of fuzzy right now, but there will hopefully be more to come, and we'd love to share more with you in the near future. And that's all we had to talk to you about today. Thanks a bunch for listening to us today. If any of this works out interesting to you, we welcome you to check out Life at Spotify.com for new opportunities around Kubernetes, cloud-native computing, and machine learning at Spotify. Thanks again for your time. We'll be taking questions now.