 Hello. In this session, I will walk you through how our team designed a multi-tenant cluster architecture to host different type of workloads and applications with ensuring security between cross-tenants and workloads. So, a little bit about me. My name is Ahmed Beboros. I'm a software engineer at delivery engineering in the New York Times. I'm excited to be here, but as you can see, I'm a dad, go for, build a lot of Go applications, Kubernetes operators. Love to build on AWS for work and hobby as well, and I'm a scuba diver, so if you have recommendation for diving, just let me know. We can talk. So, before we start, let me tell you a little bit about the New York Times. Our mission is simple. We seek the truth to help people understand the world, and we are doing that by aiming to build that essential subscription bundle for every English-speaking, Korean person who seeks to understand and engage with the world. And we are doing this by different set of products. So, news and journalism is the main and the most recognized product we use, but we also have different set of products, games, if you're familiar with crosswords, spelling bee and other things, cooking with our amazing recipes and the athletic and the audio as well. So, together on the same page, here are a couple of distinctions to explain the upcoming sections and make the picture clearer for you. When I refer to a platform team, I'm referring to my organization, Delivery Engineering, which we build and maintain the platform and the tooling equipped with the platform to help product engineering teams. When I refer to teams, that's our amazing engineering teams that build all of the set of products that I talked about in the previous slide. So, here's our agenda. We're going to talk about a few topics. Why are we building internal developer platform at the New York Times? How did we design our container runtime platform? Why did we choose Cilium as a CNI for our cluster? Or talk about like policies and preventing unneeded escalations from happening? Talk about service-to-service operation and service mesh? And also, what did we learn from the entire process? So, I'm going to start with the developer's journey. When I talk about developer's journey, they do have journey like customer's journey that we have. I have to identify the steps that most developers would go through when they're starting to develop their applications from collecting requirements, ID, planning and all of the other steps. But all of this follows, and you can see me highlighting like few of these steps here, which these steps are the most common steps that we can work around and try to make them unified across all of our engineering teams. So, we can make that developer experience in the platform seamless and give them like better tooling to help them deploy and deliver their application faster and in a much easier way. So, let me give you an example. Here are a few colors. Think about them as color ballots. And the ask here for each team is to mix them. So, like, they are representation of tools that we provide for teams. And imagine how you could use them to build an image or a picture. So, this is an exercise we're going to do together. So, I'm going to pause for like five, ten seconds, asking you to come up with something in your head about like how this would work for your team. Okay, I know it's awkward. So, we'll go through with it. So, here we go. So, like, you can see these are like different mixes of like the same colors. Like each person would have built like a different set of tooling processes around like the same toolings that we provide. So, provide like five, six tools like each team will put them together in their own format, like trying to get their application delivery. So, our goal here is to help identify and build identified experience for teams to make the process faster and easier and make them adapt their workflows in a seamless way to build their applications. So, as I said, like, we need teams to deliver their application seamlessly. So, we identified this and we start to build around this as a platform. So, starting by creating and onboarding an application to the platform, that's where, like, we come in and template all of the resources for the application to from like GitHub rebos and CI CD to link manifest all of the things that they need to get their application working. And then that's where, like, they start developing and deploying code and logic for their application. And then we come to the CI CD where they build and test and deploy their application to the platform. We also, like, provide that's where the main talk is going to be about. We're, like, the run environment where, like, the clusters and all of the configuration related to the runtime environment which is built on EKS on AWS and then we have another beast for the routing how application traffic is being routed in between their services and users and other applications as well in the same place. But we also provide, like, a set and tool set of observability and way to collect telemetry from their application across the entire stack of the platform. So that's the developer journey that we are looking for to create in the organization. So now I, like, I talked about why are we building that internal developer platform. Let me talk about the runtime setups that we have. So we experimented with different setup for our cloud accounts in the organization to build the environment on. And as you can see here that we came up at the end with a multi-account architecture because it allows us to do, like, a few things. First, like, group workloads was common business purposes. If we need, like, and have an isolation between, like, specific business units between development, between production and other things. But also, like, promote innovation. So, like, isolation between development and production account have to be separated. Limits of scope of impact have security, guardrails, all of the things that we need to have for our development and building the platform. But as well, we could provide a lot of cost allocation for all of these accounts to provide us with, like, forecast and budge thing for the entire platform. So when it comes to building Kubernetes cluster, there's a dilemma because usually platform teams goes between, like, should I build a multi-single-tenant cluster for my tenants, like each tenant would get their own clusters, or should I go with, like, a multi-tenant cluster? On one hand, having separate cluster for each team provides more control and isolation for that specific team, gives them, like, all of that necessary and functionalities that they might want. But it also comes with cost of, like, operational overhead, and, like, they have to manage all of this. And when you manage it for them, like, you can't vary, like, a couple of resources together. On the other hand, opting for a multi-tenant cluster can offer great resource utilization, simplified management, more agility in terms of, like, how you deploy your application and all of the tools stack. But you have a problem with, like, the soft-tenancy part where, like, everything is in one place. Ultimately, the decision to choose between multi-single-tenant clusters or multi-tenant cluster depend on various factors that I want to walk you through how, like, we put our factors in design conservation for that. So we started to look into the main design conservation, and we came up with, first, network isolation that we need to ensure that each standing come isolated by default from other tenants, so they can't talk to each other when it's not necessary, so we can ensure a very strict control over traffic. Then we have to come with, like, framework for, like, rule-based access control, so when a tenant gets access to the clusters, they can't, like, go accidentally, delete other people's stuff, or, like, mess with it accidentally. And then operational agility. I refer here more to traffic management and traffic splitting and how all of the service work together, how, like, I'm doing, like, intelligent routing between multiple versions of application as a developer. Talk about policy-driven security. That's how we enforce a specific set of tooling and specific set of security standards in the cluster so no one accidentally do something that they are not supposed to. They get, like, escalation to, like, a node or a machine that they shouldn't do. And at the end, like, we are talking about resource management. It's important to make sure that they manage their resource efficiently and avoid overspending and ensure optimal performance and have the workloads perform well while we optimize our clusters. So after careful conservation of our design requirement, we came to the conclusion that multi-tenant cluster are the best fit for our needs. Recognize this approach can help us achieve our goals while also minimizing operational overhead for our teams. And to support this approach, we created runtime environment that could be distributed across multiple regions to ensure failover and disaster recovery. Additionally, we connected these clusters to team cloud accounts, allowing them to expand just the use of compute to other resources as well. And it's important to note that there's no one-size fits all, like this fitted our use case and our design requirement, but it could be very hard to fit into another organization. And from here to this, this is basically like a diagram on AWS EKS specifically that shows that we are choosing two regions and you can see, like, in these two regions, we have multiple availability zone for each region, and that's where, like, we have high availability and, like, all clusters are connected together. I'm gonna talk about this when talked in a service match, but we also can see, like, this account in a specific shared environment, it's connected to other accounts as well. So, like, tenants can expand and call other resources into account, so we don't have to manage this for them. And we have, like, multiple environment, like, this is replicated across multiple environment when we need development, staging, production, other environment as well, and we manage the entire stack on EKS in my team. So, if we want to ensure network isolation and multi-tenancy in our Kubernetes cluster, we need to carefully select network components that will allow us to do that. And that's where we started looking into different CNIs as options, and, like, Cilium was the one we decided on. Because before I talk about Cilium specifically, Cilium fits in a CNI, what is a CNI? So it's a container network interface, just a framework that dynamically configure networking resources, give you all of the sets between, like, Kubernetes and the bot itself to do iBam and all of the resources needed when you try to allocate IB for your specific bot. And, like, there are different methods and how this can be done. It could be done in overlay mode, it can be done in different other modes. So why Cilium specifically? If you're familiar with, like, IB tables, like, most of, like, the CNIs use IB table for routing, so Cilium is banding this approach and using EBBF at this point, and using EBB filter for routing, meanshifting filtering tasks to the kernel space, which shows impressive gains in performance and promised originally. And, like, you can see there's a link down there about, like, the benchmarks that have been done by third-party. So another consideration in the network isolation, we need to ensure that you can't talk a service by default can't talk to another service. So we have a couple of things that we can use in the Cilium space, which is extension of the Kubernetes API, like the Cilium network policy. At first, that's where, like, we can say, okay, this namespace can't talk to any other namespace. That's where we isolate namespaces. And it provides, like, different type of policies from L3, L4, L7 capabilities and DNS-based as well. The other thing is, like, the policy enforcement mode, there are two modes. The default mode is good fit for most of the use cases. So with no initial restriction, but as soon as something is allowed, the rest is restricted. And the other policy enforcement mode, which always mode, that's helpful for environments with tighter security requirements. But we also can have cluster-wide network policies. That's where, like, you just, like, apply a single policy across all cluster. And that will ignore all of the network policy in the namespace. So that's more for a platform-specific. Observability is very important to us. And that's where Hubble from Cilium comes handy and, like, give you, like, a better understanding of, like, how the services are communicating with each other. From, like, a UI perspective, you can see, like, this is a screenshot that I took of the sandbox clusters that we have. But it's not just, like, a UI. It's also, like, you can see rich network flows that are happening in Cilium. And then you can export them, and you can start to look what's really happening between each service. On the other hand, there are a set of metrics out of the box that you can see here between, like, endpoints, how many endpoints created in the cluster. Also, like, things like drops, egress packets, dropped and, like, allowed in a lot of other things. So let's talk about our Cilium setup. So we're using Terraform to provision our clusters. And because we are in EKS, like, clusters come with VBCC and I and Qproxy, so we need the process to remove them, and we also need to configure the CNI, and then we're going to install Cilium via help. We can do that manually, but, like, the idea here in that piece of code from Terraform is, like, we can expand on how we organize this for each cluster. So in this scenario, I'll walk you through how, like, this provisioner will, like, call a script that will just, like, remove all of the necessary pieces that we don't want and install Cilium at this point. So, first portion, we're going to start removing the AWS node. AWS node is the VBCC and I that comes along with EKS. And then we're going to remove the Qproxy because we're going to depend on Cilium throughout all of the traffic at this point of time. And then we're going to apply the CNI configuration, and you can see here, like, there are specific tags that we are applying. We're going to talk about, in the second, for the ENI mode, which, like, tags, which subnets to use, and, like, also which interface to start with. And then talking about, like, the mass-grading config and other things, and at the last portion of it, that's where, like, we just install Cilium. So I mentioned the ENI mode, and then, like, I want to talk about how BUDs acquire IBs. So in the AWS space, we need all BUDs throughout to each other. Like, the policy will handle, like, which BUD that can connect or communicate with another service in that particular scenario. But in this use case, we're talking about, like, two private subnets, one for the nodes that you can see IB here in the 10 space, which, like, an ENI attached to the agent itself that will be able to provision IB in that space. And then we have another subnet, which is mainly for BUDs, where, like, they have multiple interfaces. And you can see, like, these interfaces are acquiring IBs from a completely different space, which is 100 space, and, like, there are some with one IB, but some others will be fixes. And it was a prefix idea that we can scale up how many BUDs that we have in the node. So, like, I believe it's about 16 exercise that we can do with just normal IBs, due to the limitation of how many IBs that we can attach to a single interface in AWS. So, everything is automated. From a tenant perspective, like, we created an operator that will be able to onboard all accounts and tenants into our clusters. So, the process is, once an account is being created, like, it will fire an event that will create the proper CRD and the operator will start, like, spending all of the resources needed to do that. So, we'll start with network isolation. We know that your account has specific ciders for your VBC. So, that's where, like, we say, okay, you now can talk to that specific set only. You can talk to the others, because we have, like, a private IB space. But another thing that we need to apply here is you also can talk to the internet. So, how you can do that. So, we can tell, like, in Cilium, you can talk to other ciders sets, which is zero. That means, like, you're allowed to talk to everyone in that space if it's not disallowed. And we are restricting specific set, specific IBs. You can see here that we are restricting the instance metadata. So, you can't speak to the instance metadata on the node that, because, like, we don't want you to do that and acquire information that you're not supposed to. And we also can't speak to the entire private IB set that we have in this organization. So, I talked about network isolation. Now, let's shift gears towards our back and how we can provide tenants with control for the clusters. So, just quickly, Kubernetes, our back mechanism, our symbol, like, you are familiar already with, like, have a name space, there's a role binding, and there's roles, and you can have, like, cluster roles attached to a role binding that give a user access to a specific set of name space or specific resource and actions that they can do inside the cluster. And this is a symbol tenant, our back example that we have. So, we are talking about, like, specific ABI groups, we're talking about specific verbs, get, list, watch, update, delete that based on resources, we're talking about config map endpoints, as well as, like, we are using here cluster role binding, so we have, like, a single cluster role that applies in the entire cluster, and for each tenant, we give them a role binding into the space that automatically created for them. So, they can now manage all the resources they need inside the cluster. But here's the problem, verb and resources are not enough, because you can basically do a lot of things, but there are things and scenarios that we cannot work with. For example, like, how we can ensure all bugs have limits and requests in their manifest, or how if we give tenants access to deploy services, how we can tell them, like, you can deploy all the services you need except load balancing ones. But security bosses also are not enough. So, I will go back to the Kubernetes ABI flow, trying to get to how we can implement this beyond our back. So, we're talking about ABI handler. Basically, like, request comes in. There's, like, different components in Kubernetes that handle this. Authentication and authorization. Authentication is one of the modules already comes with Kubernetes that provide, like, analyze a request, see, like, what credentials you have to allow it. Authorization, see if you are applicable to do the actions that specified in the request. And then, there's, like, admission control, or, like, having mutation control with, like, can allow you to mutate your request so I can add specific things to your request, then the validation, and then, at the end, you write to our persistent data store. So, we can see, like, where Kubernetes ABI flow works. So, what we can do is something simple. We can start writing a lot of validation with hooks. Like, okay, cool. Let's start building a ton of them for each single thing that we need. It's basically that we can extend that and say, hey, for CreateBuds, just call the service and make sure your BAS request by. And if we don't want to allow this, just don't allow it. And it's simple, which that's dynamic admission control. The problem is that is not scalable. Like, we're going to start, like, writing a lot of validation rules, and that's something that we don't want to maintain on the long term. Also, it can get complex if you have, like, multiple validation web hooks, because, like, every single request will have to pass through them. So, where in the open source world something that we could use. So, we start to look into solutions and GateKieber was one of them. So, what's GateKieber? OBA GateKieber is a civilized implementation of OBA, which is open policy agent, provides integration with Kubernetes. And basically, they come, like, with all the mutation and validation admission controls that we need, and they also have to pass through them. And the first time I heard of Frigo, it was intimidating when I started to look into it. But just once after you get passed that, things will be fine. So, that's where we're using GateKieber to provide more methods inside, like how we prevent users from doing things that they're not supposed to do. So, a couple of things here, Veebo, sorry, GateKieber comes with constraints and mutations that you could use. So, a couple of examples here, like, that's container limits. So, we're saying, like, containers must have limits. And those things that we can say, mutation, please assign this metadata to when you have a mutation coming. And there are a set of different constraints and different mutations that you can use. And there's a library for that, and I just left the link for the library, which is called GateKieber OBA library. But, beyond what shipped with GateKieber, we need to apply a specific customized solution that fits our use case. One of the use cases that we ran into is spoke about EKS, we spoke about multi-tenant, now we have service account that using what's called RSA, IAM role for a service account and services usually use that to being able to call AWS resources. So, we need something to use an admission control that to make sure that you are only specifying the IAM roles that you are using and you are allowed to that maps your account. So, that's a simple rego policy that will allow us to not just do that, but like say, hey, this is allowed on the patient that you can apply. So, we wrote a template for that and then apply it to the cluster. And the output is something like that. So, we look at the operator who will like know which account you're coming from because we already have all of this information and this is the constraint that we are applying here. So, we are saying like for specific kinds which service account on the specific name space that's automatically created, please only allow the specific projects and you can see like your account number will be listed here. So, if you try to create a service account with an IAM role that doesn't belong to your service account or it's not mapped to, you probably will get a violation out of the box and that will give you an error that must match. And you can see the enforcement action here is dry run. We can switch that and make your deployment fail if we need to. So, so far we talked about Scyllium, network isolation, we talked about roll our backs and now I'm going to talk about like how services are starting to communicate with each other. So, we're going to talk about service mesh. And basically what's a service mesh in a world like there's direct service API call, I'm not going to spend too much time on the declaration of it, request and response, but we need something that will abstract a lot of things that usually developer add to their application and like move this to a sidecar. So, we'll have a control plane and the control plane will inject sidecars to your container out of the box and all of the functionalities that we need will get there. So, why we need one? Because it can become complex when you start to like get service mesh, especially in a multi-tenant architecture. So, we need it for a few reasons. First of all, we need it for observability. Like, every developer will start to create their own web service and they will start to create their own metric. And we don't want this to happen. So, we want to have unified metrics across the board that we can say, okay, this service is causing errors for this service or something else is happening. Also, we want a security. So, we're talking about MTLS, we talk about like which service can talk to each other and other things like traffic control. So, we need like a ways that we can say, okay, a route between that service, this version and another version and all of that kind is already provided out of the box with the service mesh. And we don't have to ask developers to do any of that. So, at this point, we already have an ingress mall in our clusters that get traffic from like north to south. So, when I say that like traffic coming from the internet will hit invoice, this is outside of the service mesh. So, like, we have invoice containers on the clusters that will just route traffic based on a specific domain when request comes in and match it to an upstream. So, how we integrate something with our ingress mall to a service mesh. So, we created something like this which is just like having STO in place, have like these are clusters in two different regions and we have services deployed in both clusters and like with having an east-west gateway between these two clusters, services can start communicating to each other. So, traffic comes from north to south to like our invoy interceptors and then they get mapped to the service and if a service in cluster in east region can start talk to a service in cluster in west region and all of that manage automatically by STO. So, in a multi-tenant setup we have the same concept of like having an STO gateway by name space, by tenant. So, that's where like invoy start to create like these CRDs for us through our operator which is something we don't like this. So, the operator will start like create an internal we have already set up the load balancers and everything that we need from a gateway perspective but then like we need to create like smaller pieces for each tenant. So, like gateway for this tenant X and it's allowing traffic on HTTB and HTTPS and we define the domains that we get for each tenant. So, I went too fast and the last slide that I want to talk about here is takeaways. I spent like some time trying to explain the process that we went through and there are like a lot of takeaways that we can we can like talk about. Like things like open source is great. Like we are, I already described like three components that we are using in our clusters that are already open source. Like between STO, between gatekeeper, between other things. But the problem is expanding all of that to developers is really getting complex because like I'm trying to introduce a lot of things to the developer. So, I need them to understand like Kubernetes, I need them to understand like STI, I need them to understand like all of the things happening to just deploy an application. So, what we found that is that with all of the words that we've done to manage their infrastructure we still need another way to give them a simple way to deploy their application. So, that's when we started looking into like things like open the open application model where like things like we can just define one manifest that will do all of the all of the components behind the scene and that's where like we are still like in an early stage where we're trying to develop like more concepts to make it abstract from that perspective. And that's about it. So, thank you. If you have any questions. Okay, so the question was like we use AWS, like did we choose AWS before we start the project or we used it after like we already chose to use AWS, we used different cloud providers as well, but for that specific project we decided to work on AWS for the nature of like how other things are deployed into our cloud specifically. So, EKS seems to be a fit but like all I'm describing here is just the platform. We can like take the same platform and have it in another Yeah, so like the only it might be specific like how our cloud accounts are being set up but like other than that we can we are using Kubernetes which can fit in any other cloud provider and most of the tools tooling here is open source so we can just deploy to another cloud provider if we need. Okay, so the question was more like are we looking only for isolation or what other metrics that we are thinking about from performance and latency. So, like if we are talking about isolation, yes it's needed because we are in a multi-tenant architecture so like we can still like build smaller single clusters in a single account and they still going to be like have lower latency. The point is like how to handle all of that when we scale so like we are not running a single cluster or smaller single cluster like 40 nodes or 50 nodes we're running like big clusters that like have hundreds and potentially going to up thousands of nodes that can scale at this point. So, we are also looking for latency and scale and how we manage all of that for the entire organization. So, yes at the point that when we start so the question was like we are using Scyllium but can we get the same features or some of the benefits that Scyllium provides with original things that deployed with EKS. So, when we started like working with the project like AWS C9 wasn't providing like a network policy that we can apply to as a late namespace. I believe they start to look into that and that's something they start to implement at this point so you can use that. But if we are talking about like scaling performance observability all of the other components that we needed Scyllium for that and IB tables from like benchmarks showed that it's a little bit less so if we can do it once that's why we chose Scyllium but there are other scene eyes that you can use especially like the VBC C9 eyes or Scalicodes or other things that use similar approaches maybe EBBF maybe not but they can provide some other metrics as well. Sure. Okay, so the question was is this a multi-region deployment and how this is scaling. So, yes it is a multi-region deployment and really depend on like you as a developer from a perspective the infrastructure resides into two regions geographically and split into multiple availability zone. So if we lose one we have others and everything has the same scale mechanism so like everything is like geographically split but like we can scale each cluster separately if we need and because we also split these between accounts and between environments so like we treat production differently from dev but each of them can scale up and scale down based on whether your application needs so you can deploy an application that requires like 100 BUDs, 100 nodes and you can deploy applications that just require two replicas at this point and you can deploy to east, you can deploy to west, you can deploy to both and that's where our ingress model is helping because we have an ingress model that fits on top of like the two regions so you can deploy your application and then route traffic from a single domain name and then have a failover mechanism that only applies based on like where is the request is coming from and based on load balancing and other things. Can you repeat the question? So the question is is it the same for STU? Yes it is so that's there's a north south where like traffic comes in but there's an east and west that's where other load bansers come between each cluster so like any BUD inside or any application inside the east region can talk to another application in the west region and this is automatically comes connected so like they are discoverable and like for example like I was looking at it recently if you try to get in points on specific application you will see like that internal IP address for your specific BUD we are also going to see like more IP addresses for the load bansers that traffic will start to get exposed to. What is the question is what is the typical type of traffic that we run between east and west? So it really depends on like the nature of the application itself so if you are deploying your application to one region basically there's no like traffic will run east west but like there are some use cases where we deploy like applications into multi-vision and that's where like one service need to talk to another service and maybe the service is not available in east for a specific use case and that's where it goes to the west or vice versa so we have an application that it's being deployed to east and west and then another application only getting deployed into east so that's where traffic will come from west to east to get like all the ABI calls from that service and basically it's anything from like data like things related to news or like user information services things like that yeah there are some applications that requires high availability if they are in a critical path that need to get deployed to multi-vision just to ensure like if we have a failure in one region we have another region to fed but not every application fed so there are some applications we don't want to get into a complex situation so we have seen different use cases where it starts from like an application just like lives in one region and in one replica and just doesn't have any data stores and then we have applications that like have hundreds of replicas with database replicated in multiple region with like hot cash with like everything that it's necessary if one region failed that we can like go serve traffic from other region so that's active active and we also have situation where it's active passive where like most of the traffic comes to the east region everything served from there but we had just another region on the stand by if things happen to the east automatic we'll feel over to that region so there are different types okay so the question is we've been for a long time if we fire an employee how like we okay so yeah what we do here is like basically single signed on so like to get access to these clusters it's tied to a single sign on so if we just like activated or deactivated your access that's where we also can like deactivate specific accounts so when we talk about the clusters we basically grant you access to the clusters through your access to your account itself so like you need to have access to your account first and then we create all of the IAM rule and nested permissions to access this specific account to speak to the cluster on that specific name space that's where like it's tied together once you lose access to your AWS account you will not be able to access the cluster by nature any more questions okay thank you all appreciate you coming