 Hello everyone my name is Adrian and together with Victor we are going to present to you our five-year journey of how we leverage namespaces in a multi-tenant fashion in order to scale the adoption of Kubernetes at Adobe and also how we build a foundation for the Adobe's developer platform. Few words about me I'm currently a lead class of engineer at Adobe and I'm part of the ETOs team which is the team that powers the Kubernetes platform at Adobe. I'm also a member of the Kubernetes GitHub organization and currently I'm focusing on contributing as much as I can to the cluster API ecosystem and when I'm not breaking clusters I like seeking for big bikes as you can see on the slide. Victor do you want to introduce yourself and kick off the presentation? Yeah sure thank you Adrian. Hello everyone my name is Victor Varze. I'm technical leader at Adobe. I'm also passionate about open source contributions. I'm one of the organizers of Kubernetes community days or KCD in Romania which will be the first KCD event in the southeast of Europe and it will be organized next year in April. Together with Adrian we are the authors of Adobe's K-Tesh Redder and cluster registry. Two open source projects that we successfully integrated to our platform and about each we are going to talk today. In the first part of the presentation I'm going to talk about project ETOs, Kubernetes namespaces and capacity management and Adrian will continue with governance policies, multi-tenancy at scale and non- disruptive Kubernetes upgrades plus a live demo so stay tuned. Before we dive in I would like to share with you a nice quote by Martin Kagan which I found in his recent book titled Inspired. He says it doesn't matter how good your engineering team is if they are not giving something worldwide to build. In other words it is important what your engineering teams are building but also the target audience. At Adobe our Kubernetes platform called ETOs is used by amazing internal engineering teams which are working at Adobe products such as Adobe Photoshop, Adobe Analytics, Adobe Firefly, Adobe Experience Manager, Adobe Sign and so on. Project ETOs it's a cross-cloud multi-tenant Kubernetes based platform built through the collaboration between the Adobe's infrastructure teams and product development teams. The initial version of ETOs has its roots in 2015 and it was built with Docker and DCOS and it was first in production in 2016. It was a good decision at that moment to start with DCOS because we gain experience with containers, microservice architectures, multi-tenancy before Kubernetes became mature. We also built necessary abstractions so a developer can seamlessly deploy his application in production. This abstraction is called ETOs CASE or containers as a service. In 2018 we started a development of the next generation Runtime platform based on Kubernetes. We identified an opportunity within Kubernetes namespaces and add them as a new option for our developers. This offering is called ETOs CASE which stands for platform as a service. With ETOs CASE developers take ownership of the Kubernetes namespace and they can deploy their application inside. Of course we're using their preferred CI CD tool and this approach gives flexibility to developers and it is particularly valuable when your application serves as the core CI CD tool and can deploy other applications in Kubernetes. It is also valuable when your company is involved in acquisitions of other companies. So making my migration of the applications to your company's platform, it's a straightforward process using Kubernetes namespaces. In 2019 we started the full migration of the legacy CASE users from DCOS to Kubernetes and in 2022 based on the experience that we got with ETOs CASE and ETOs PAS we introduced a new flavor which is ETOs Flex. ETOs Flex is running on top of ETOs PAS and it is based on GitOps and Argo. So it provides a path way to deploy your application in production but also the flexibility of Kubernetes namespaces. Another big milestone was this year when we adopted cluster API and Argo for the infrastructure side for building and managing Kubernetes clusters. This is a ETOs Kubernetes platform from 10,000 feet and how it stands in Adobe. On the top of the slide we have the three main Adobe Cloud, Creative Cloud, Experience Cloud and Document Cloud. These clouds are powered by Adobe software products such as Adobe Photoshop, Adobe Firefly, Adobe Analytics, Adobe Experience Manager, Adobe Sign and so on. And together with the platform that they are using such as Sensei Machine Learning, Content Platform, Experience Platform, all together are running on top of ETOs and ETOs is basically the Adobe runtime for containerized applications. ETOs operates on three main cloud providers, Adobe Private Cloud, AWS and Azure. To better understand the platform scalability, let's talk about the pretty impressive numbers which are growing every month. ETOs holds more than 2 million containers encapsulated in 1 million pods and these pods are running in 41,010 namespaces, namespaces which are owned by the application development teams. We're managing more than 300 clusters deployed on 28 different cloud regions in AWS, Azure and Adobe Private Cloud. In terms of computing power, these workloads use around 35,000 compute nodes consuming approximately 2.9 petabytes of RAM memory and 800,000 virtual CPUs. The AI applications which are more and more present to our platform utilize almost 8,000 GPUs. Let's talk about multi-tenancy Kubernetes. How many of you heard about multi-tenancy architecture? And how many of you are using Kubernetes in a multi-tenant architecture? Okay, we have a pretty good number. I can count them. There are many definitions for multi-tenancy. At Adobe, we are using multi-tenancy architecture as a way to share multiple physical clusters with multiple teams from different organizations and different projects. And we have two types of clusters, shared clusters and dedicated clusters, also known as multi-tenant clusters and single-tenant clusters. Shared clusters are available for any internal engineering team in Adobe and are highly valuable for optimizing the cost and enhancing the overall platform reliability. Dedicated clusters, on the other hand, are used for two main purposes. When high security isolation is required, such as for applications that can run on the software. For instance, Adobe Experience Manager, which is a content management system solution, can run software written by Adobe customers. Another scenario is when a specific team requires high-resource demand for their application, which need the entire cluster resources. An example of this is Adobe Firefly, which is a generative AI content creation solution that requires a high-resource demand for the available CPU and GPUs inside of a cluster. In order to implement multi-tenancy, we rely on Kubernetes namespaces. And developers love namespaces because they provide flexibility, more control, and easy troubleshoot their applications. In ETHOS, we use a unique namespace name across the entire fleet. And we deploy a namespace profile template on the clusters. A namespace profile template is made by few Kubernetes objects in order to provide a minimum isolation within a cluster. First of all, we need a Kubernetes namespace object to group the objects for a single team within Kubernetes API. To ensure that only a specific team has access to a particular namespace, we are using role bindings that link the default Kubernetes cluster roles, admin, edit, and view. And Kotlin range play a crucial role in limiting and controlling resource consumption, ensuring fair resource distribution of the resources inside of a cluster. For network isolation, we are using both Kubernetes native network policies and Selium network policies. And here Selium network policies are useful for implementing DNS-based policies and other layer 7 policies. After the namespace profile is deployed on a cluster, the tenant can deploy his application inside. And the tenant application objects will be restricted only to a specific team, and the posts will be isolated by the default network policies. In a multi-tenant environment, capacity management is a key consideration because capacity issues may result in higher costs. By the way, who doesn't have cost concerns today when running an application in the cloud? We tend to take actions at three level. At the pod level, so in addition to horizontal pod auto scaling and vertical pod auto scaling, we are using a solution named automatic resource configuration. At the namespace level, we simplify quota management using the concept of baseline quota unit. And at the cluster level, we added capacity alerts. Let's go through automatic resource configuration. We know that in Kubernetes, pods are scheduled on the worker nodes based on their container resource requests. And they can burst up to the specified limits. So if the resource requests are lower, then smaller allocations are reserved for that pod. This allows for more pods to be scheduled on the node, which results in cost savings. And to achieve this, we rely on Prometheus metrics to get their historicalization data for the deployment pods. Then an OPA policy, it's applied to adjust the right size of the CPU and memory request for that specific pods. At the namespace level, in order to simplify quota management operations, we introduced the concept of baseline quota unit or BQ. A BQ is actually a quota definition. And every namespace quota increase, it's achieved by multiplying each of the BQ items. And for example, we have a BQ definition here. And if you want to allocate, let's say, 32 virtual CPUs for our namespace instead of 16 CPUs as we have right now, we just simply increased the namespace quota from one to two BQs. And the other BQ items will be multiplied as well. So we also have available for our namespace 60 pods to run. This approach simplifies the operations for both tenant owners of the namespace and cluster administrators. At the cluster level, we measure if a cluster is the capacity using Prometheus alerts. And how we are doing this. In ETHOS, the source of record for cluster information is storing an application named cluster registry, which by the way is open source and it is available under Adobe's GitHub organization. And there is a cluster agency client that runs in every cluster and accepts signals from other managers. And in Prometheus, we have multiple capacity subalerts that fire based on some specific metric thresholds. And yeah, for example, number of nodes or number of available IPs that can be assigned to a node or number of namespaces and so on. And one of the subalerts fires, the main capacity alert notifies cluster registry client. So cluster information is updated. And for example, namespace onboarding is disabled or even more namespace quota increase is frozen for all of the existing namespaces in the cluster. Now I'm going to pass it to Adrian so we can talk more about governance policies, multi-tenancy scale and non-destructive Kubernetes upgrades. Thank you, Victor. I would like to continue our talk about multi-tenancy but tackle it from an infrastructure perspective. And I prepared three topics today in order to cover the reliability and efficiency on one hand and scalability and security on the other hand. And I will start with the governance policies. As any company or business is governed by a set of rules, so does a multi-tenant Kubernetes cluster. Why are these policies mandatory and what benefits do they bring into the Kubernetes ecosystem? From our perspective, along with a security aspect, there are two main advantages when defining a set of rules inside a Kubernetes cluster. First, it is for safeguarding teams against inter-team collisions and second, for protecting the cluster stability so that a single development team cannot jeopardize the entire cluster. A few years ago, when we initially started building our platform, we had a pretty interesting outage. One day, two distinct teams created two different ingress objects in two distinct namespaces but pointing to the same FQDN. And it took us a while until we realized that these two ingress objects were conflicting with each other because there were no validating webhooks implemented by the ingress controller at that time except the CRD schema. And this was the moment when we decided that a set of governance policies were mandatory. And in order to implement these policies across our cluster fleet, we picked the OPA Gatekeeper framework. For those of you who are not familiar with, OPA Gatekeeper is an extensible admission controller which is already configured with all of the necessary Kubernetes API plumbing. And cluster operators can change the business logic of Gatekeeper by simply writing policies, which are regular queries as short as a few lines. And we will see in a bit such an example. But getting back to our outage, after that event, we created the validating ingress policy which denies the creation or update of ingress objects which attempts to use an FQDN, which is already in use by any other existing ingress objects. Some other example policies we are currently deploying across our cluster fleet. If I have to name a few, I will stop to the control plane toleration policy, which is a policy used to restrict the workloads that can run inside the control plane nodes. Crown job history, another policy, which is used to restrict the history of a crown job so that we are not putting unnecessary load on the ACDE side. Default ingress class, another interesting policy, which is used to add an ingress class on all ingress objects that are not explicitly specifying the ingress class they want to use. Nayspace limit, another policy which is used to limit the total number of nayspaces that can be created inside the cluster. External IP services, another policy which is used to deny the creation of external IP services. And many others. As you can see on the slide, we have a Rego snippet that's implementing the external IP services. And as you can notice, with only a few lines of code, we were able to define a pretty powerful policy, which is denying the creation of any external IP services. And what is it doing? First, it will check if the object from the request is of type service. Then it will check if the operation is creator update. And in the end, we'll check if the object spec has any external IPs defined. And if all these three conditions are met, we are rejecting the request and sending back a message to the user, stating that external IP services are not permitted, because there is a pretty high vulnerability found inside the Kubernetes code base. One thing to keep in mind here is that Gatekeeper, as any other validation or mutation web hook, adds latency to any API request it mutates or validates. Why? Simply because of the extra processing time needed to mutate or validate the request. So the more policies you define, the higher API response latency might be. Another story is about multi-tenancy at scale. As you saw the numbers at the beginning of the presentation, you can imagine we are running at a pretty high scale and challenges for such a big platform are diverse. Recently, we switched from our internally developed CI CD tool to the Argo ecosystem. I guess everyone is already familiar with Argo. We just had a Argo call a few days ago. But during our Argo CD evaluation process, one of the first challenges we encountered was the fact that a single Argo CD instance couldn't handle the reconciliation volume needed for our fleet. And to give you an idea, we are deploying between 70 to 90 admin components per Kubernetes cluster. And with a fleet of more than 300 clusters, you can imagine that the total number of applications needed to be synced is over 24,000, way higher than a single Argo CD instance can handle. And so in order to be able to scale the rollout of all the admin components across the fleet, we've come up with a pretty interesting pattern which we called Argo of Argos. As you can see on the slide, we are running a multi-tier Argo CD setup where tier 0 is used to reconcile the tier 1 Argo CD instances. And Argo CD instance and tier 1 Argo CD instances are used to reconcile the or to sync the cluster admin components fleet-wide. Sorry. Moreover, each tier 1 Argo CD instance is handling only a subset of the Kubernetes clusters that are part of the fleet. Also, all tier 1 Argo CD instances have the same config and have registered the same set of application sets so that we have consistency across the entire tier 1 Argo CD instances. And in this way, we accomplished the flexibility when it comes to scalability of the continuous delivery system. We can always add more tier 1 Argo CD instances or remove them based on our platform needs. And the last story for today is about non-disrupting cluster upgrades. As our platform evolved and started onboarding more and more teams, the diversity of the workloads running on top was also increasing. Some teams started running stateful apps like databases or distributed event streaming apps that, as we know, are pretty sensitive to disruptions. And after we had few outages caused by cluster upgrades, it was clear that we needed to develop a new strategy while doing cluster upgrades because the poor disruption budgets alone were not enough. And we were looking to have a high enough velocity while rotating the worker nodes but still maintain the client's apps availability and infrastructure costs at some reasonable thresholds. And so we came up with what we call the park nodes upgrade strategy. And this strategy is implemented around K Shredder, which is a Kubernetes controller developed in-house at Adobe and then open sourced. It is available under the Adobe GitHub org and you can scan the QR code from the slide in order to get access to it. How does our non-disruptive cluster upgrade procedure work from a high-level perspective? During the full cluster upgrade, we are draining in batches a percentage of the total worker node at the time while adding new worker nodes. Also, we are coordinating all the existing worker nodes so that no new pods can be scheduled on them. And for the sake of the example, let's assume we have a cluster with two worker nodes, which we are going to upgrade to a newer Kubernetes version. Once the upgrade process begins, as I mentioned earlier, we add a new worker node running a newer version of Kubernetes and then start draining the old nodes. Evicted pods will be moved to the new worker node since the old ones were already cordoned at the beginning of the upgrade process. As the upgrade is progressing, more pods are moved to the new node until there is no capacity on it. And if we are running out of resources, we are simply just pinning up new worker nodes to accommodate all the pods that are evicted by the draining process. Okay, if during the configured drain timeout, not all the pods are evicted, the upgrade process will label the worker as parked and add a TTL or a time to live for it. Also, all the old ones that were successfully drained and which don't have any running pods on them will be recycled by the cluster autoscaler or by the upgrade process eventually. And with that, this is the moment we consider the cluster upgrade as finished. And once the upgrade is finished, development teams that are still running pods on park nodes are getting notified so that they can take all the necessary measures to move their pods out of these park nodes before the TTL expires. And once the draining process is finished on all worker nodes, then case shredder is taking over the process. What is it doing behind the scene? First, it will identify all the park nodes and then for each of them will grab all the running pods and for each of these pods will run a set of eviction loops. Initially, it will periodically try to soft evict all the running pods on the park nodes while respecting the PDBs. And most of the pods will be successfully soft evicted by case shredder after a few eviction loops. But some of them won't be able to but still case shredder will periodically monitor the TTL of the park node. And if after the configured park node TTL there are still running pods that couldn't be soft evicted, then case shredder is taking a pretty aggressive measure and will just force evict all those running pods. And once there are no more pods running on the park node, cluster autoscaler will just recycle this park node. And with that, all worker nodes from the cluster are running the new version of Kubernetes. Putting all these steps together, you can notice that the process is pretty smooth and eventually all worker nodes will be running a newer version of Kubernetes. Okay, let's see it in action. We prepared a live demo for today where we are going to simulate a full cluster upgrade in order to see how case shredder can help us clean up the running pods from a park node. We already have a cluster running with one control pane and two worker nodes and we are going to park one of these worker nodes. In the upper left terminal we are going to start labeling and coordinating a worker node and on the upper right terminal we are going to watch the pods that are running on the node we are going to park using kubectl get pods and the watch command. And on the bottom terminal we are going to watch the case shredder logs so that we can get a feel about what it's doing behind the scene. Let me restart the case shredder so that we can have some clear logs. Okay, case shredder started. Let's see the running pods. Yeah, so we have a bunch of pods running on the node we are going to park. Pods from different A species coming from a stateful set. Pods with bad PDBs, pod with allow eviction, that allow eviction pod that doesn't allow eviction and so on. Let's park this worker node. We added a TTL of just one minute so that we can see a fast iteration of what shredder is doing behind the scene. As we can see, shredder already reacted and noticed that there is a park node in the cluster and started the eviction loops for all the running pods and as you notice many pods were successfully soft evicted by the shredder during the first iterations but some of them won't be able to but still shredder will periodically try to soft evict them until the TTL of the node will expire. These pods couldn't be soft evicted because they have bad PDBs configured or because the tenant explicitly disallowed eviction and after one minute we should see, yeah, K shredder is also able to perform rollout restart for a deployment or a stateful set that are that are behind the running pods and once the TTL expires on this park node we will see that K shredder is taking that aggressive action and will just force evict all the running pods from the park node and with that we can see that we don't have any running pods on the park node and cluster autoscaler can chime in and safely recycle the worker node. Okay, this was the demo and with that I'm going to pass it back to Victor for the conclusions. Thank you Adrian, very good demo and this time didn't fail. Let's wrap up our few takeaways of five years journey of running Kubernetes in a multi-tenant architecture. There is no silver bullet while building a multi-tenant developer platform. You should always align with your product development teams in this process. Every company is different and it has its own needs and vision regarding to the multi- tenancy architecture and here Kubernetes namespaces are feasible to build the boundaries. And the last thing and also the not the last but not the least challenges while working at scale are different comparing with small or medium-sized platform. Thank you for your attention. I think we have time for questions. Anyway, we will be also available for the next 10 to 15 minutes for offline questions if you have. Also please scan this QR code so we can provide us some feedback. Thank you and yeah, do you have any questions? I have one question on the chargeback model. You have various teams using the shared Kubernetes clusters, right? So how does a chargeback model to these departments work like the Photoshop team or EM team or various? Yeah. Do you have anything in process or the second question is how do you focus on optimization? Some teams just over provision, right? Like don't set the right resource limits. How do you optimize that? Okay. Yeah. So for the first question, how are you charging back our users, right? We're using a solution called Kube cost for this and with some algorithms we provide them what is the actually cost for running their application in our namespace or in our clusters because we have also users that use dedicated clusters. They basically use the entire cluster. And also we are adding labels on the namespaces like service ID, the team that's running inside that namespace and we can easily correlate the pods running in that namespace with a specific team and so that we can easily charge back them. Yeah. Thank you. We tried that. So I'll definitely explore that. The second question around the optimization of the resource limits. Yeah. So for the resource limits as I show in the presentation we have a project called automatic resource configuration. It's an internal one but we are thinking to open source it. So basically based on the Prometheus metrics you calculate how what is the right size of the CPU and memory cost that the pods should have. Then we label the deployments then an OPA policy will hook when the pods are created on the cluster and just at the moment we optimize the right CPU request the memory request for that pods. Yeah. One mention to your question is that we are not changing the resource limits only the resource request because the scheduling is done based on the resource request not on the limit. That's why we can over commit on worker nodes. Hi. I had two questions also. One question is you showed the sort of the namespace profile and all the namespaced objects that get set to control things like the quota and the limit range and role bindings. I'm assuming the developer teams don't set those because they kind of constrain what the team can do so how do those get there in the first place and how do they get updated when there's like a new standard for that. Yeah. Very good question. So actually we have an automation that five years ago it was a script and we provision namespaces using jara tickets and then we added into an automation an API and those profiles are static are controlled by us and they are deployed on the cluster by the by the user. So it's a self-series mechanism to deploy the namespace profile in the cluster but they don't control the profile actually we we are controlling the profile and depending on what we are changing on the profile even us update the profile on the cluster but if it's something that can impact an application like network policies we delegate to the end user to do the update. Yeah and one addition here is that tenants are not directly talking with the API server when they want to create a namespace they are talking with our Itos Kubernetes on boarder it's the application and that application is talking with the API server when creating a new namespace and that tool is also adding this namespace profile all the network policies and all those stuffs. Yep. Okay and the other question is you mentioned cluster API at the beginning but where does cluster API fit into this whole picture? It doesn't fit in our presentation but I just mentioned that it's a milestone that we should mention that we for the infrastructure are actually adsopted cluster cluster API and also Argo to building and managing Kubernetes clusters Yeah it's just a mention to Is one of the Argos in that two-tier picture a part of that infrastructure piece? Yeah. Okay. Thanks. We can take offline the discussion because we finished the time. Thank you very much again.