 Hello everyone, thank you for joining this session, very good talk so far, actually it's the first time we are here for this conference in Shanghai, it's impressive, my name is Victor Varza, I'm a technical lead and together with my colleague Adrian, we are both working at Adobe in Romania in Bucharest office on the Adobe's Kubernetes platform named ETHOS. We are both passionate about open source contributions, I am one of the organizers of Kubernetes community days or KCD in Romania, it will be the first KCD event in the southeast of Europe and it will be held next year in April. Adrian, do you want to introduce yourself? Yeah, sure, hello everyone, my name is Adrian, I'm a lead cloud software engineer at Adobe and I'm also part of the Kubernetes GitHub organization and currently I'm focusing on the cluster API ecosystem project. Today we are going to be talking about our five year journey of how we leverage namespaces in order to scale the adoption of Kubernetes in Adobe. In the first part of the presentation, I'm going to talk about project ETHOS, Kubernetes namespaces and capacity management and my colleague Adrian will continue with governance policies, non-destructive Kubernetes upgrades and multi-tenancy at scale. The project that we are working on is named ETHOS and it is a multi-tenant Kubernetes based platform established through the collaboration between the Adobe's infrastructure team and the product development teams or application development teams. The initial version of ETHOS began with Apache Messos and DCOS in 2016, we were in production. It was a good decision at that moment because DCOS was a production ready solution and we got experience of running containerized applications and microservices in production. Then as Kubernetes matured, we initiated the migration from DCOS to Kubernetes and we started in 2018 and one year later we almost hit 100 Kubernetes clusters. Along with Kubernetes project, ETHOS is using many of other open software projects such as Cilium, Prometheus, OpenPoliceAgent, Argo, and so on. This is the high-level overview of ETHOS Kubernetes platform and its position in Adobe. On the top of the slide, we have the three Adobe Clouds, Creative Cloud, Experience Clouds, and Document Cloud. These Clouds are powered by Adobe products such as Adobe Photoshop, Adobe Firefly, Analytics, Adobe Experience Manager, Adobe Sign, and so on. These software products, together with the platform that they are using, such as Sensei Machine Learning, Content Platform, Experience Platform, all together are running on top of ETHOS. ETHOS is basically the Adobe runtime for containerized applications. At the bottom of the slides, we have the three cloud providers on which ETHOS operates on, AWS, Azure, and Adobe Private Cloud. To better understand the platform scalability, let's talk about the pretty impressive numbers. ETHOS hosts more than 2.1 million containers encapsulated in almost 1 million pods. Those pods are running in 40,000 tenant namespaces or namespaces owned by application development teams. We are managing more than 300 Kubernetes clusters deployed in 28 different cloud regions in AWS, Azure, and Adobe Private Cloud. In terms of computing power, these workloads use around 32,000 virtual machines or Kubernetes nodes, consuming approximately 2.7 petabytes of RAM memory and 750,000 virtual CPUs. Additionally, the AI applications are using around 2,000 GPUs. Let's talk about multi-tenancy in Kubernetes. How many of you are using Kubernetes in a multi-tenant architecture? We have a good audience. There are many definitions actually for multi-tenancy. In Adobe, we learned that multi-tenancy in Kubernetes is a way to share multiple physical clusters with multiple teams from different projects or different organizations. And we have two types of clusters. One is multi-tenant clusters and dedicated clusters, also known as multi-tenant clusters and single-tenant clusters. Third clusters are available to any internal Adobe team, and they are highly valuable for optimizing the cost and enhancing the overall platform reliability. Dedicated clusters are used for two purposes. When high security isolation is required, such as for applications that run on trusted software, for example, this is Adobe Experience Manager, which is a content management system application capable of running software written by Adobe customers. Another scenario is when a specific team requires high resource allocation for their application, which basically use the entire cluster resources. For instance, Adobe Firefly, which is a generative AI powered content creation solution, significantly consume the entire cluster GPU and CPUs. In order to implement multi-tenancy, we rely on Kubernetes namespaces or virtual Kubernetes clusters. Developers love namespaces because they provide flexibility, more control, and easy troubleshooter applications. In E-Tos, each namespace name is unique across the entire fleet and it is generated based on a namespace profile template. A namespace profile template is primarily composed by Kubernetes primitives in order to provide a minimum isolation within a cluster. First of all, we need Kubernetes namespace to group objects for a single team in the Kubernetes API. Then a role binding is used to implement an authorization mechanism for the namespace, ensuring that only a specific team has access to a particular namespace. Quota and limit range play a crucial role in controlling and limiting recent consumption, ensuring fair resource allocation within the cluster. For network isolation, we are using both Kubernetes native network policies, but also Selium network policies. Selium network policies are very useful for implementing DNS-based policies and other layer 7 policies. After the namespace profile is deployed on the cluster, the tenant can deploy his application inside. The tenant application resources will be restricted only to a specific team and the posts will be isolated in the cluster using the default network policies. In an environment, capacity management is a key consideration because capacity issues can result in higher costs. By the way, who doesn't have cost concerns when running the application in the cloud? Costs are very important. We tend to take action at three level. At the pod level, so in addition to horizontal pod autoscaling and vertical pod autoscaling, we utilize a concept known as automatic resource configuration, or ARC. At the namespace level, we simplify quota management using a concept named BQ, or baseline quota unit. And at the cluster level, we added capacity alerts. Let's discuss about automatic resource configuration. We know that in Kubernetes, pods are scheduled on the worker nodes based on their container resource request and they can burst up to a specific limit. So if the resource requests are lower, smaller allocations are reserved for that pod, right? This allows for more pods to be scheduled on the worker nodes resulting in cost savings. To achieve this, we rely on Prometheus metrics to get the historicalization data for the tenant deployment pods. Then an OPA or Open Policy Agent policy will adjust the right size for the CPU and memory request for the tenant deployment pods. At the namespace level, to simplify the quota management operations, we introduce the concept of BQ or baseline quota unit. BQ is a quota definition and every increase in the namespace quota is actually achieved by multiplying each of the BQ items. For example, we have a BQ definition here. If you want to allocate, let's say, instead of 16 vCPUs for our namespace, 160 vCPUs, we simply increase the quota for that namespace for one from one to 10 BQs. And the other BQ items will be multiplied as well. So you'll have also the ability to run 300 pods on our namespace. This approach simplifies the quota management operations for both tenants and also for the cluster administrators. What happens if a cluster is the capacity and what that means? To save the cluster is the capacity, we have multiple subalerts that fires based on some specific metrics like number of nodes, number of available IPs that can be allocated to the worker nodes. And when one of the subalerts fires, the main capacity alerts notifies our automation so no new namespaces can be created on the cluster or even more for the existing namespaces, we cannot perform quota increase operation. Now I'm going to pass it to Adrian so he can talk more about governance policies, in this trap, they provide the subgrades and multi-tenancy as scale. Thank you Victor. I would like to continue our talk about multi-tenancy but tackle it from an infrastructure perspective. And I prepared three topics in order to cover the reliability and efficiency on one hand and security and scalability on the other hand. And I will start with the governance policies. As any company or business is governed by a set of rules, so does a multi-tenant Kubernetes cluster. Why are these policy mandatory and what benefits do they bring into the Kubernetes ecosystem? From our perspective along with the security aspect, there are two main advantages when defining a set of rules inside a Kubernetes cluster. And the first one is for safeguarding development teams so that they cannot add collisions between them. And the second one is for protecting cluster stability so that development team cannot jeopardize the cluster. A few years ago, when we started building our platform, we had a pretty interesting outage. One day two distinct teams created two different ingress objects in two different namespaces but pointing to the same FQDN. And it took us a while until we realized that these two ingress objects were conflicting with each other because there were no validating webhooks implemented by the ingress controller except the CRD schema. And this was the moment when we decided that a set of governance policies were mandatory. And so in order to implement these policies across our cluster fleet, we picked the OPA gatekeeper framework. OPA gatekeeper, for those of you who are not familiar with, is an extensible admission controller which is already configured with all the necessary Kubernetes API plumbing. And cluster operators can change the business logic of gatekeeper by simply writing policies which are regular queries as short as a few lines. And we will see in a bit such an example. And so, getting back to our outage, after that moment, we created the validating ingress policy which denies the creation or update of any ingress objects which attempt to use an FQDN which is already in use by any other existing ingress objects. Some other example policies we are currently deploying across our cluster fleet. I would mention the control plane toleration which is a policy used to restrict the workloads that can run inside the control plane nodes. Another policy, cronjob history which is used to limit the history for a cronjob so that we are not putting unnecessary load on the HCD side. Default ingress class, another interesting policy which is used to add an ingress class for all ingress objects which are not explicitly specifying the ingress class they want to use. Namespace limit, another policy which is used to limit the total number of namespaces that can be created inside the Kubernetes cluster. An external IP services, another policy which is used to deny the creation of external IP services. And as you can see on the slide we have a regular snippet which is implemented implementing the external IP services policy. And as you can see with only a few lines of code we were able to define a pretty strong policy which is denying the creation of an external IP service. What it does, first it checks if the object from the request is of type service then it checks if the operation is create or update and finally it checks if the service pack has any external IPs defined. And if all these three conditions are met we are rejecting the request and sending back a message to the user stating that external IP services are not allowed because of a pretty high vulnerability that is found inside the Kubernetes codebase. One thing to keep in mind here is that Gatekeeper as any other validation or mutation webhook adds latency to any API request it mutates or validates. Why? Simply because of the extra processing time needed to mutate or validate the request. So the more policies you define the higher API response latency might be. Another story is around cluster upgrade. As our platform evolved and started onboarding more and more teams the diversity of the workloads running on top was also increasing. Some teams started running state flags like databases or distributed event streaming apps that are pretty sensitive to disruptions. And after a few outages that were caused by our cluster upgrade process it was clear that we needed to develop a new way or a new strategy while doing cluster upgrades because the podistruction budgets alone were not enough. And we were looking to have a high enough velocity while rotating the worker nodes on one hand but still maintain the client's apps availability and infrastructure costs at some reasonable thresholds. And so we came up with what we call the park nodes upgrade strategy. And this strategy is implemented around Cache Shredder which is a Kubernetes controller developed in-house at Adobe then open sourced. It is available under the Adobe GitHub org and you can scan the QR code from the slide in order to get access to it. How does our non-disruptive cluster upgrade procedure work from a high level perspective? During a full cluster upgrade we are draining in batches a percentage of the total worker nodes at the time while adding new worker nodes. Also we are coordinating all the existing worker nodes so that no new pods can be scheduled on them. And for the sake of the example let's assume we have a cluster with two worker nodes which we are going to upgrade to a new Kubernetes version. Once the upgrade process begin as I mentioned earlier we are adding a new worker running the newer version of Kubernetes and then start draining the old nodes. Evicted pods will be moved to the new worker nodes since as I mentioned earlier the old nodes were already cordoned at the beginning of the upgrade process. And as the upgrade is progressing more pods are moved to the new nodes until there is no capacity on them eventually. And if we are running out of resources we are simply just pinning up new worker nodes so that we can accommodate all the pods that are evicted by the draining process. Ok. If during the configured drain timeouts not all the pods are evicted we then label the worker node as part and add a TTL or a time to live to it. Also all the old nodes that were successfully drained and which don't have any running pod on them will be recycled by the cluster autoscaler or by our upgrade process eventually. And with that this is the moment we consider the cluster upgrade as finished. And once the upgrade is finished development teams that are still running pods on part nodes are getting notified so that they can take all the necessary measures to move their pods out of the part nodes before the TTL of the node expire. And once the draining process is finished on all worker nodes then Kishredder is taking over the process. What is it doing behind the scene? Initially it will identify all the part nodes from the cluster and then for each of these part nodes will grab all the running pods and for every pod will run a set of eviction loops. Initially it will periodically try to soft evict all the pods running on part nodes and most of the pods will be successfully soft evicted by Kishredder after few iteration or after few eviction loops. But some of them won't be able to be soft evicted and Kishredder will still periodically monitor the TTL of the part node. And if after the configured part node TTL there are still running pods that couldn't be soft evicted then Kishredder will take a pretty disruptive action. And will just force evict all those pods that were still running on the part node. And once there are no more pods running on the part node cluster autoscaler will just recycle those nodes. And with that all worker nodes from the cluster are running the newer version of Kubernetes. And putting all these steps together you can notice that the process is pretty smooth and eventually all worker nodes will be running the newer version of Kubernetes. And the last story for today is about multi-tenancy at scale. As you saw the numbers at the beginning of the presentation you can imagine we are running at a pretty high scale and challenges for such a big platform are diverse. Recently we switched from our internally developed CICD tool to the Argo CD ecosystem. I guess every one of you are familiar with Argo. But during our Argo CD evaluation process one of the first challenges we encountered was the fact that a single Argo CD couldn't handle the reconciliation volume needed for our fleet. And to give you an idea we are deploying between 70 to 90 admin components on a per Kubernetes cluster. And with a fleet of more than 300 clusters you can imagine that a total number of applications needed to be synced is over 24,000 way higher than a single Argo CD can handle. And so in order to be able to scale the rollout of all the admin components across the fleet we've come up with a pretty interesting pattern which we called Argo of Argos. As you see on the slide we are running a multi-tier Argo CD setup where tier 0 is used to reconcile the tier 1 Argo CD instances and tier 1 Argo CD instances are used to sync the cluster admin components fleet-wide. Moreover, each tier 1 Argo CD instance is handling only a subset of the Kubernetes clusters that are part of the fleet. Also, all tier 1 Argo CD instances have the same config and have registered the same set of application sets so that we have consistency across the entire tier 1 Argo CD instances. And in this way we accomplish the flexibility when it comes to the scalability of the continuous delivery system. We can always add more tier 1 Argo CD instances or remove them based on our platform needs. And with that, I'm passing it back to Victor for the conclusions. Ok, so let's wrap up a few takeaways for our 5-year journey of running Kubernetes in a multi-tenant architecture. There is no silver bullet while being a multi-tenant developer platform. You should always align with your target audience, your user, or, in our case, product development teams. Every company is different and it has its own needs and vision regarding to the multi-tenancy architecture. And here Nayspecies are a viable solution for building boundaries around multi-tenancy. And last but not the least, challenges while working at the scale are different comparing with small or medium-size platform. Thank you so much. Hope you enjoy our talk. If you want to get in touch with us, here are our contacts. Additionally, you may be wondering where the Catech Redder icon came from. As you can see on the slide, Adrian excels as a fisherman. While on the other hand, he is one of the worst sailors. Thank you. If you have any questions? Sorry, I didn't just say very clearly about one of the slides, about the PACT node. Could you please explain again what is the PACT node and if the timestamp you list their expires, what will happen? Adrian, do you want to take it? I can take it. So, let me... So, the TTL or the Time to Live or the timestamp of a node is added at a node level and Catech Redder will periodically evaluate the TTL of the node. And if the TTL of the node expires, then Catech Redder will force evict all the pods that were still running on that node after the TTL expired. By force evicting them, this means that it will not take into account the PDB of the deployment those pods are coming from. That's why it's a pretty disruptive action. So, can I understand that the pod Z1 is the non-destructing pod? Z1 will be... It's a non-destructing pod. It must stay at the original node of Z1. Yeah, but the trick here is that once the node was parked, all the users that were running pods on that park nodes will get notifications. So, they will have maybe days or maybe a few days to move their pods out of the park node. Okay, got it. Thank you. Yeah, one more thing here. Catech Redder, it's an open source project. And you can try it. Another question? I might have missed at the beginning. So, I understand that once you've created the namespace for the user, you pretty much have no... You give them a quota, but then they can do whatever they like. Yeah, that's responsible for how they set up CICD or any of the other best practices into the namespace? Yeah, actually, we have many teams in Adobe and some of them want to use their CICD, other bio acquisitions are using their own CICD, like Jenkins or Spinnaker. We also, as project instance, offer our CICD solution. But, yeah, for these users, we gave them the namespace as a service. And basically, the user is responsible for what is putting there. Cool, thank you. Another question? I have a question. How about disrupting, upgrading? You just introduced about the Kubernetes. What about Google API server and control manager, that part? I think it's about the non-destrupted Kubernetes upgrades. And the question was why we introduced this instead of using the normal Kubernetes controllers? Is that right? Kubernetes server, that part. Will you please repeat the question? Oh, okay. About that page, about upgrading. How about upgrading of Kubernetes server? Oh, okay. So at the control plane level, given that we are in control of those nodes, there is no need to have a park node strategy there because we have full control over those nodes. We can always manually intervene. We as cluster operators. So the park node upgrade strategy is applicable only on the worker node side, not on the control plane side. Okay, okay. Okay, other questions? Okay, so thank you so much for your attention and for your questions, very good questions. If you want to get in touch with us, we'll be around. Also you can ping us by mail or LinkedIn. Thank you.