 Okay, hello everyone, thank you for joining this session. My name is Victor Varsa, I'm a technical leader at Adobe and together with my colleague Daniel, we're going to talk about multi-tenant versus micro clusters based on our experience of running Kubernetes in Adobe. If you want to follow me, I'm part of the developers for activity organization at Adobe, which is to help our customers, the developers to write better software even faster. I'm passionate about the open source contributions and I'm one of the authors of K-Tesh Shredder. Okay, and I'm one of the authors of K-Tesh Shredder and cluster registry to open source software that we successfully integrated to our platform. I'm also one of the organizers of KCD Rumania, which is the first in-person KCD event in the southeast of Europe and which is going to happen next month. Daniel, do you want to introduce yourself and kick off the presentation? Hello everybody, thank you Victor. My name is Daniel Koman. I'm a lead engineer in Adobe Experience Platform. I'm a long time Kube follower and first time speaker. So be a bit kind to me. So first to understand the journey with Kubernetes tendency in Adobe, I'm going to talk about how we started with micro clusters in Adobe Experience Platform. And Victor will explain about how we implemented Muted-C in Project ETOs, Adobe's unified compute platform. So first of all, what is Adobe Experience Platform or for short AEP? It's a centralized and connected data foundation that powers customer experience management across Adobe Real-Time CDP, customer journey analytics and journey optimizer. It provides tool for data ingestion and data governments with AI-driven insights. It integrates with other Adobe products to offer seamless experience for customer data management and marketing automation. So for AEP, we use a hub and spoke model and for this presentation, I'm gonna focus on the edge component outliner. It seems quite small on this chart, but if you take a broader look, we can see that the edge network is quite large. This is what we call the edge network. And here is where our story begins when we were faced with the challenges of rapid regional expansion and needed to quickly spin up edge infrastructure. This was a few years back when Kubernetes was a relatively new technology among our developers. Project ETOs was just starting. Victor will tell us more about that. But in short, they were focused on our hub location in Azure. And for the edge network, the strategy was to build infrastructure in AWS. So if he wanted to iterate quickly, we needed to do it in the benefit of their efforts at that time. We started this project with a team of five people with some ambitious goals and tight deadlines. First of all, build seven regional locations in AWS. Build a new infrastructure stack based on code automation. And quite importantly, help edge applications migration to Kubernetes. If this task wasn't enough, we were required to be in production in six months. So we knew application teams had limited exposure to the Kubernetes ecosystem, and we had limited manpower. So we needed to focus on a few important things like helping teams understand Kubernetes and its advantages, helping in application migrations, provide a streamlined onboarding experience. And something that we call offer batteries included out of the box experience. So this was our understanding that what mattered most was the developer experience. Still infrastructure provisioning is still a crucial part of our work. We sought some shortcuts to a vice-paying effort on low-level resource provisioning and maintenance. Manage solution, obviously fit our needs. And within AWS, we opted for EKS. Adobe was among the first to use EKS at scale back in 2018, I think, when it first became generally available. We began our work on the line of infrastructure, and at the same time, we started looking at what teams and services we'll be using this platform. These early discovery talks are a vital process. This is where we figured that if we wanted the team to embrace Kubernetes and succeed, we needed to give them some assurances, build up trust in the platform, incubate it as a compute layer, and help them get some hands-on experience. That's why we decided to use this topology, the single tenant or microclusters, how we call it. It gave us some short-term advantages like predictable performance, single tenants, it gives us better isolation from lazy neighbors, improved stability, improved stability. We also benefit from a simplified super matrix. It's much easier to customize application-specific configurations. It gives us simplified governments, managing access control, permissions and policies is more straightforward in a single tenant cluster. No need to navigate complex multi-tenancy configurations. And all of this allowed us to rapidly build up infrastructure and iterate. So we launched in production six months later with a shared responsibility model, with application teams having full Kubernetes admin access to the cluster. This worked quite well in the beginning when teams needed to onboard their application for the first time in Kubernetes. Having this level of freedom with full control over a cluster made the process a lot simpler. The complicated onboarding process easy to customize and test, access to debug the whole stack and they were learning the ropes to hold things up in Kubernetes. But yeah, nothing good lasts forever. We wanted to discourage teams from modifying infrastructure applications. Wanted them to not change configurations manually. Very important, they were running an optimized compute. So over provisioning of resources was quite a problem. In some cases, having too much freedom can have catastrophic consequence. We had a notable incident when one of our team accidentally deleted all the deployments on the cluster. They had a bug in their CICD automation. We were learning some hard lessons here. So from an open garden, we started to close things up. First we implemented access control, limiting access to tenant-only namespaces. For this to work, we had to also pre-provision namespaces. Then we had to look at our costs and we didn't like what we saw. So to enforce proper capacity planning and encourage responsible resources, we implemented quotas. Then security came knocking on our door and we had to implement strict network policies. To further combat some misbehaving tenants, we added policy-based admission controls. As you can see, the line between what a single tenant cluster means and a multi-tenant one started to kind of blur at this point. So in the end, this is how the edge architecture ended up looking like. This is one of our edge locations with one service per microcluster. One of these clusters, they are multiple services composing the bigger service. I have to mention two noble services that we help migrating Kubernetes. The first one is stateless. It's called Conductor. It's the API gateway aggregation later layer used in AP. And the second one is Pipeline. Pipeline is our multi-region distributed transport queue based on Kafka. This one proved to be one of the most difficult services that we had to migrating Kubernetes. This is the board's overview of the Edge Network and Hub. This was our journey with the single tenant microcluster approach. Next in Adobe, we decided to consolidate on a single platform called ETOs. At this point, I left Victor to pick up the presentation. Thank you, Daniel. Let's talk about multi-tenancy. I think everyone in this room knows what is multi-tenancy. How many of you are living in a multi-tenancy city? I think everyone, right? We can draw an analogy with big cities where we share the same infrastructure, like buildings, transportations, health services, security services, and other services. Similarly, we can do it with applications that are running on Kubernetes. We can imagine Kubernetes as a big city where we share, where multiple tenants consume and run different services. At Adobe, developers run their applications on the runtime platform named ETOs, which is a multi-tenant Kubernetes-based architecture. Along with Kubernetes, ETOs is using many other open source software projects, such as Cilium, Prometheus, OKGateKeeper, Argo, Helm, and so on. This is ETOs Kubernetes platform from 10,000 feet and how it stands in Adobe. On the top of the slide, we have the three main Adobe Clouds, Creative Cloud, Experience Cloud, and Document Cloud. These clouds are powered by Adobe software products, such as Adobe Photoshop, Adobe Firefly, Adobe Analytics, Adobe Experience Manager, Adobe Sign, and so on. And together with a platform that they are using, such as Sensei Machine Learning, Adobe Content Platform, Adobe Experience Platform, all together are running on top of ETOs, and ETOs is basically the Adobe's runtime for containerized applications. To better understand the scale, here are some pretty impressive numbers. ETOs operates at three main cloud providers, Adobe Private Cloud, AWS, and Azure. Spending 28 different cloud regions, and we are managing more than 340 Kubernetes clusters and hosting more than one million pods, which are running in 42,000 Kubernetes namespaces, namespaces which are owned by application development teams. In terms of computing power, the platform consumes about 3.9 petabytes of RAM memory, more than one million of virtual CPUs, and tens of thousands of GPUs for the AI workloads. The multi-tenancy approach at Adobe is very simple. We use multi-tenancy architecture as a way to share multiple physical clusters with multiple teams from different projects and different organizations. And we have two types of clusters, shared clusters and dedicated clusters. Shared clusters are available to any internal Adobe team and are highly valuable for optimizing the cost and enhancing the overall platform reliability. Dedicated clusters started based on the idea of microclusters are used for two main purposes. When high security isolation is required, such as for applications that can run under the software. For instance, we have Adobe Experience Manager, which is a content management system solution that needs to run software written by Adobe customers. Another scenario is when a specific team requires high-resus demand and customizations for their workload, that needs the entire cluster. An example of this is Adobe Experience Manager that needs to run highly scalable and stateful apps, such as Kafka event streaming platform. Even if we are referring to multi-tenant or single-tenant cluster, the workload isolation on ETOs is the same. We rely on Kubernetes namespaces and couple of Kubernetes objects to provide a minimum isolation within the cluster. For this, we are using the concept of namespace profile template, which we pre-define and control, and which is made of a few Kubernetes objects, such as Kubernetes namespace to group objects for a single team within a Kubernetes API. Cluster roles and role binding to ensure that only a specific team has access to a particular namespace. Code and limit range for controlling and limiting recent consumption, ensuring fair issues distribution within a cluster, and network policies and serial network policies for the network isolation. And the reason that we are using serial network policies is because we need by default some DNS-based policies and other layer seven policies. The onboarding process to ETOs platform is pretty easy. The user just needs to specify some custom values, such as the namespace name, the admin edit and view LDAP groups, the clusters where the namespace is going to be deployed, and the namespace profile template. And the namespace profile template is rendered and deployed on the clusters and the tenant can deploy his application inside. And the tenant applications Kubernetes objects will be restricted only to a specific team and the pose will be isolated by the default network policies. From the developer's perspective, the multi-tenancy architecture is just an isolated namespace where he can deploy his application inside. But there is an overhead that comes with multi-tenancy, which is not necessarily seen by the end users. First, we had to implement the namespace isolation with the namespace profile template concept. Then we need to have some governance policies at the cluster level in order to safeguarding teams against inter-namespace collisions and ensuring the stability of the clusters by preventing a team from compromising the entire cluster. And in order to implement these policies across the cluster fleet, we are using the OPA gatekeeper framework. And a basic example of such as policies is validating the ingress objects use an unique FQDN within a cluster. Capacity management is a key consideration in multi-tenancy because capacity issues may result in higher costs. We measure capacity at the namespace level using the Kubernetes quota object and at the cluster level using the Prometheus alerts. So we added capacity alerts in Prometheus that automatically stop on boarding when the cluster reached the capacity in terms of number of nodes, number of available IPs that can be assigned to a worker nodes, number of ingress objects, and so on. Another point that needs to be considered in multi-tenancy is non-destructive infrastructure changes. Our platform evolved during the time and we started on boarding more and more teams so the workload diversity expanded. And now we are hosting sensitive stateful apps such as a databases or Kafka. And we had some outages in the past caused by the cluster upgrades that made us to conclude that both destruction budgets alone are not enough. We aim to achieve the rotation of the worker nodes while ensuring the availability of the tenant applications and keep the infrastructure cost within a reasonable limits. So we come up with an open source solution named K-Tesh Rather that implement park nodes upgrade strategy. And if you are interested to learn more about this solution, I recommend to have a look on GitHub under Adobe's organization. A multi-tenant Kubernetes-based platform should be even more cost efficient. We are analyzing if the workloads that are running on a tenant namespaces are compliant with ETO standards that could impact the cost, reliability, and the security of the overall platform. For example, regarding to the poor destruction budgets, we've seen many deployments not having the PDBs associated. Or on the opposite, have the PDB set to zero disruptions allowed. But under example, it's misconfigured liveness and readiness probes. You'll be surprised, but not all of the developers add the right liveness and readiness probes for the applications. We are also looking for the CPU and memory quest settings. And we make sure that the pods have the right size of the CPU and memory quest based on their historical data. So these are the points that represent the overall multi-tenancy overhead that could be complex to set up a scale. And now I'm going to pass it back to Danielle for the conclusions. So a few takeaways from our journey with running multi-tenant single tenant clusters in Adobe. There is no silver bullet for building a Kubernetes platform. You should always align with your organization needs and specific requirements. It's quite essential to acknowledge that challenges faced when operating at scale can differ quite significantly from those encountered in small and medium size platforms. For us, both solutions work, but we understand that efficiency in cost and scalability from running multi-tenancy at scale. And on the other hand, micro clusters with single tenancy can help your organization to grow faster because it provides better customization opportunities and simply it's easier to set up. So thank you for your presentation. If you have any feedback for us, please scan the QR code and we are open for any questions. Okay, so thank you for the talk. Have you seen when you move to share a situation where because the pods are all being sent in the same node and there might be situations where you have like the bandwidth of the network to be affected even though they might be using a low amount of cores, but the shared network is an issue on the pause. Thank you for the question. So the question is if we seen some performance issues who are moving from single tenant to multi-tenant cluster. Well, it depends before the teams are moving from single tenant to multi-tenant, they are doing some benchmarking. So they are testing their applications. So here it depends. Some teams remain on a single tenant cluster because we still provide a single tenant offering, but they can also move to multi-tenant if, for example, they don't have so much spikes that could do some noise in performance to the other tenants. I can say also, yes, we've seen some, when we talk about IO, that's a network IO especially that's hard to provide quotas for that. And we still use, we still have Kafka deployed in a single tenant configuration. Hey, thank you for the presentation. That was very interesting. So you said you started the team with five people and now you are managing 340 clusters. How big is your team right now? And I have a second question. Did you study the node pool solution like segregating the workload, not on different clusters, but just different nodes? Thank you. Thank you for the questions. So the project with five people was actually the Daniels project where they started to eat very fast using only EKS and single tenant or micro clusters. Then in order to provide the scale, they migrate their solution to Ethos platform, which is the standard run times platform for Adobe to run containers, let's say. So, yeah, it was a different approach. They just started with small team Twitter fast. Do you answer one? So how many people are running Ethos right now? How many engineers? More than five, of course. I mean, I think like 20, 50. Something like this. Okay, yeah. So I think 10 times more than we were back then. In my company, we're managing like five clusters and we are 10 people. So I was like, okay, that's a very good five people for a lot of customers. It was hard, but that's why single-tenancy work for us, I mean, it was easier than what these guys did, so. And for the second question, you said that if we are using some dedicated nodes within a cluster for some tenants, yes, we provide this solution, but not for a single tenant, let's say, for let's say a feature, most likely. Like they want to run GPUs and we only have nodes with GPUs and only post that requires GPUs can run on those nodes. So we don't do multi-tenancy, I mean multi-tenant clusters with isolated tenants per nodes. If they need to isolation, the strict isolation, they just move to a single-tenant cluster. Thank you. Another thing about node pools. So we are not only thinking about compute power here. You have the Kubernetes control plane and in a multi-tenant cluster, you have a lot more objects that put pressure on that, HCD and other components. So that's another reason that you will, even if you don't have a large number of nodes, the control plane might be more fitted to be in a single-tenant. Hi, over here. Thanks for your presentation, first of all. It was very interesting. You mentioned that you're still having shared clusters as well as dedicated clusters. Now you mentioned all these security improvements that you made to the shared clusters to make sure customers don't impact each other. Do you still see like a technical reason, especially security-wise, to still stick with dedicated clusters? So do you see a concrete advantage here or is this more of a compliance thing? Yeah, thank you for the question. Very good question, actually. We led the tenants that actually landed first on the single-tenant cluster to just use that single-tenant cluster because they had some, let's say, security concerns. And they scale so big that they just need the entire cluster. So I don't say it's necessarily a security concern of running on a single-tenant cluster, but it was of being getting, yeah, so. I think right now single-tenant and multi-tenant provide the same security as we put here in the slide. The security is almost the same. Probably it's more complex to do it in multi-tenant. All right, thank you. Thank you, too. I think we can get one more up, we finished. If not, we can take the questions offline. I got one quick question. Yeah, sure. You guys were talking about microclusters. What exactly is a microcluster? Very good question, Benio. So in the edge, we had some applications that only took up like three nodes, three compute nodes. That's what's the micro part of it. It's single-tenant, but it's also very small and it's the domain of the cluster. It's a single service, which might have like three or four microservices, but it's basically dedicated to a single team. That's the concept of it. You map the cluster. Microcluster is a small single-tenant C cluster. Yes, it's mapped to a single service and to a single team. So a team has full access to it and we mapped one team, one service to one cluster. Even if it's three compute nodes. Yeah, thank you. Yeah, I think we are on time. Thank you for our questions. If you have other questions, we can take them or find them. Thank you. Have a good day.