 Hi, my name is Dave Grzanti Intro slide in a second I didn't want to cut us out So we're gonna talk today about what we learned designing and scaling a multi-tenant developer platform Like I said, my name is Dave Grzanti. I'm a principal engineer at the New York Times, and I'm gonna hand it over to Ahmed to get started Hello everyone Welcome and let me start with our Mission statement like we seek the truth and help people understand the world Like are we doing like that by aiming to build a stencil subscription bundle for every English Speaking curious person who seeks to understand and engage with the world So at the New York Times we serve like different type of products between news and journalism is the most Recognizable product for us, but we also have games like crosswords spilling bees and other games that you might be familiar with Also, we have cooking if you are looking for some amazing recipes and Thanksgiving is coming along Wire cutter and audio and zesteletic first port So I'll start by a couple terms around here to get us on the same page. So when I talk about platform team we Really focusing on the delivery engineering organization. That's our team We're like we build a platform around the organization to help other engineering teams When I talk about team, it's like the product engineering organization where like these are the teams that building all of the product features that I Just talked about So there are few topics here that I'm gonna go through like I'm gonna talk about like why are we building internal developer platform? as in New York Times and I'm also gonna talk a little bit about the runtime architecture and then we'll hand it over to David talk about Application templating continuous delivery and then are we gonna tell you what did we learn through our journey for this a process? So let's start with the developer journey We talk about like developers They have a journey like customer journey and like they starting by getting their business requirement until they deliver their application So we talk about all of the phases that they go through from planning design Create like monitoring and all of the steps it goes through but like there are a few steps Are here kind of like our commonality between most of the applications that we're looking for and these are the pieces that are highlighted here and In these steps you can see like that's where our platforms can come in and try to like Standardize and make sure that we deliver standard experience to all the engineering teams across the organization So But innovation has a cost when like we give all of the tooling to engineers around like they build things really amazing and innovative products But in different techniques and different tooling and different standards. So if we can imagine something about we have like different color plates here and and Color ballots and then we asked to make them together and see what is the results that every team can come come up with So you're probably looking at something like that It's just like it's the same tools But like each team creates their own standards and creates their own like way of how they start like building their services And building their code together So our goal here is not to limit their innovation But it's more to help like to deliver like a standard experience and make like these tools are adapted in a faster way So we come up to like how we gonna make this happen and what are the commonalities between all of the steps that I talked about earlier So we start talk about like the workflows that the users go through from like the previous slides that I was mentioning earlier And then that starts by creating an application that will template all of the resources that the team need to get their application started Then they have the source control for their application where like everything comes through And then we provide them with a CI CD step which like have all of the building test and deploy steps Then we have the runtime which will dive more deeper into what that's where like we come in like Kubernetes clusters and based on based on cloud resources and other things And last is in grass that's where like all of the traffic from the customer side is coming and we back all of this by observability level that observes the entire process for our team So now let's dive deeper into like how we are doing the runtime architecture So we experimented with different setups for our cloud account setup And that like between like having a single account and having a multi account and different things So we found out that the multi account architecture is the best fit for us Because it gave us like a lot of group workloads with common business purpose in distinct accounts avoid dependencies and conflicts It also like apply distinct security control between like development, production and all those other things And it have like a cost of like while like we have separation we can give teams like more freedom about innovation So they really can like have their dev accounts and they start innovate and build new products But we also limit the scope Okay That also limits the scope of impact between accounts because they have like different accounts that they can be able to go through So now we have like the management account that like every team gets their account like how we centralized the runtime architecture here Unlike how things are being done in the organization So usually like if you try to build a platform or runtime centralized runtime architecture You come as with the dilemma and this dilemma is about like do we run like a multi single tenant clusters Which like we operate all the clusters so team A and team B each one get like a specific Kubernetes cluster on their side And we just manage it for them or we go for like a single cluster across the entire organizations and we start like doing the multi tunnels And while this can fed in some use cases and the other can fed on other use cases I think like we started to look into like why we would do each and like how we settle on that things that we need And it's not one solution fits all like it's depend on the organization and depend on how the team want to open like be open to maintain and orchestrate all of the work happening So to go through why we decided what we decided we have to go through on the requirements and what do we have actually to decide on So let's start by network isolation So if we decide either we will want to have like a network isolation on the clusters if we have multi tenancy for example Because by default like the single 10 cluster is already isolated from the other so like you get one cluster you don't have access to the others But when you talk about multi tenancy we have to ensure that like each name space or each stand is isolated from the other tenets So they can't like escape and for example like if someone being able to attack one tenant they can't escape to the other tenants for example We will base the access control how we make sure that all tenants gets the right scope for their access and spaces, name spaces And operational agility how like we make sure that this entire process is automated and like we can draw faster like how we can import tenants to our clusters Also one of the aspects that we look into is the bossies revenue security So like when we talk for example like the entire platform is built on top of EKS for example So how we can make sure that like if we are running in a multi tenant cluster I'm not be able to assume a different role from another tenant that already have set up their account or different servers Also resource management how we ensure like you are using the right resources for your application and you are not like a noisy neighbor for others and all of that aspects here So combining the multi account architecture with like the multi tenant architecture I think we found like a good use case for how we will design our runtime architecture here And like we came to the conclusion that multi tenant clusters are best fed for our needs So we recognize that this approach help us achieve our goals and minimize the operational overheads that we have And to support the approach we created the runtime environment that could be distributed across multiple regions to also ensure failovers As you can see here that we have like multi region clusters across different environment and each team account by default when they are onboarded to the platform They get access to these clusters by default and as I mentioned earlier it's important to understand that no one size fits all So that solution and design considerations fits our use cases and the things that we need So now we talked about the multi tenant clusters and multi account but like there's one aspect of like how we make the onboarding process easier Because like now we have like hundreds of teams that we need to keep onboarding automatically So for that perspective we start to think about like can we do it in get ops mode? Can we have this as a self service? How we can make sure that these teams are onboarded to the clusters without too much of like manual intervention So we came up to a conclusion where like can we automate that? Yes we can But like where is that glue to automate like the tenant onboarding with the actual accounts So we built something around operators, if you're familiar with the Kubernetes operators So operator is like just make sure that we recycle something and operate and acknowledge it So it keeps this iteration in a way that like every tenant is onboarded They will get something in return for that So what happens here is like once your cloud account is being created We listen to these events and then we still follow a get ops approach Where like we transform that event to a CRD as you can see here So a CRD is a reference for a tenant and that's a this tenant belong to that account So when that happens the operators behind the scene built different things that allow us to achieve our design consideration Starting with the tenant onboarding and the network isolation here we're using Cilium So Cilium provide us with capabilities that we can like isolate the name spaces created for that tenant automatically So once we import the tenant we get like all of the network isolation here There are like our Cilium policies specific to a tenant and there are like cluster wide policies specific for our tenants The other things that we're looking around and I mentioned that earlier is like we're using EKS So if you're familiar with ERSA IAM role for service account like that's how services are actually consuming AWS resources So one of the areas that we found that anyone in a multi-tenant can use the same because they are using the same OIDC They are using the same trust policy So in this scenario how we want to make sure is that like a specific IAM role for a different tenant doesn't like just fail And you can assume it as another tenant So that's another part of the operator itself So the operator will like set a constraint, bear tenant, bear name space specifically And that will ensure that you only can assume the roles that it's related to your account And that's based on the specification of your tenant CRD And the last thing that I want to go through is I'm not going to go through the entire ingress model But there is another piece that how your traffic comes through the cluster So to describe this like really fast, our ingress model is based on like invoice So like this is how all of the traffic comes through and then like we forward this to upstream And you can see it like we have a service mesh setup based on STO and like that's on a multi-region And we have all of service can communicate to each other But like how we set this with the tenant boarding So what we do here is the multi-tenant setup, same thing The same operator, the same CRD, so for all of this we have a single CRD for a tenant That now starts to template once it's onboarded to the cluster That will template all of the resources needed for STO itself from a gateway, from a certificate From all of the things that we need at this point to make the tenant viable And they can start just like deploying more applications And at this point like all of our runtime is set up And tenants can start like just consume Kubernetes resources in general And start like moving to the next steps where like how they start to template their application And they have their build and pipeline And here, I'm going to pass it to David, he's going to walk us through that Hey everyone, so yeah next we are going to talk about two sections Kind of stepping back a stage back to more of what the developers are interacting with So as I talked about we kind of designed some of the runtime concepts first Multi-tenancy, multi-account, and then we had this challenge of figuring out How we make this easy for developers at the time to use, how do they kind of onboard So we had this goal of allowing developers to build fully functional services Running in production on day zero and under ten minutes Which is an ambitious goal, I think we're mostly the way there There's a couple things that take maybe longer than that That are still in manual approval, but the idea is still present And we're working towards that time period But generally we kind of built this template engine That includes a bunch of capabilities for developers And then they only have to kind of plug in their own logic Which is that user provided thing on the right So I'll go through some of these in more detail in a second But the idea is to give them all of the necessary things that they need To use the runtime architecture, the ingress layer And their multi-tenant AWS account Without them having to do much more than provide a few details And then all they're doing is working on app logic So generally all of this is stored in GitHub someplace Like it's maintained in the GitHub repo We're following GitHub's principles for all the pieces There's a couple of external systems which we'll show in a second But generally everything's stored as source code of some type So the first two items, source code control setup And the code starter kit You can kind of pick a few different languages to start from But we'll give you either a go app Or you can choose a Docker template Which you can provide your own code, your own Docker file And then we'll build it for you The observability tooling is based on open telemetry For the go starter project That'll include metrics and traces out of the box for you Ship it off to the observability tooling And open telemetry libraries are a little bit easier For some of the other languages But we're working towards providing support for it Like the four or five that are our core things The next is secrets integration into Vault And I don't know how people feel about Vault But it tends to be sort of complicated to get secrets in And then get them injected into Kubernetes in the perfect way So our goal is that developers should just have to say What their secret is, key value And it will kind of take care of it Putting it in the right spot And setting up the Kubernetes template So that it'll pull it in for them Containerization is the next thing So like I mentioned, the starter project comes with the Docker file You don't have to do anything If you want to use your own code That's not go or one of the languages you support All you have to do is just change the Docker file We'll build it, push it to the registry for you And get it deployed The next two things are kind of a combo The build and test pipeline And the deployment pipelines And we're using a combination of CI tooling And Argo CD If folks are familiar with that To do deployments to Kubernetes So drone is responsible for most of the build And testing jobs And Argo's responsible for doing all of the Infra deployments to Kubernetes And one of the other things that the application template Does is set up all the necessary Argo stuff for you The app project and application sets That are needed to manage the multi-tenant permissions To kind of match with what I was talking about before To make sure that people aren't crossing namespaces Or crossing between tenants And then the last thing, which might be the most complicated At least for users who aren't really comfortable doing Kubernetes Is like all of the YAML that they need To get their app running So that they're not kind of writing that themselves So we're kind of using a few different tools Most of what we're doing now is built on customize We're playing with where Helm can come in To abstract some of that away from a user So we're not dumping a bunch of customized stuff Into their repositories that they have to manage later But in general, if you just want to get a Go app or a Go API up and running on the platform And you don't care about any complex stuff within Kubernetes You shouldn't have to touch really anything that we give you Other than giving the name of an app And your target tenant You should be up and running in a few minutes So what does that look like from a workflow perspective? So as a team or a user Your first step is you fill out this form Generally giving us some information about What GitHub team you're from What tenant you're targeting What the name of your app is, what language you're using Do you want to use some experimental features That we're working on? You click submit And then all your responses get stored As a JSON file inside of a GitHub repo And that kicks off a drone job And that basically spits out a few different GitHub repos that represent what goes involved The actual secrets involved And then what gets templated out as these Argo projects That handled syncing all of your deployments to Kubernetes And then assuming that all your PRs get merged And everything happens And your app is up and running in our runtime cluster So what the user sees is kind of just this simple form And then we tell them Like here's all your stuff stored in GitHub If you want to understand that But the idea is that All they're really interacting with is that form interface So let's talk a little bit about the continuous delivery From a multi-tenancy perspective I'm going to talk a little bit about Argo specifically I won't go into a ton of detail But since that's kind of where most of the multi-tenancy stuff came in I want to focus on that So our idea was we wanted to Follow GitHub's principles for delivery to Kubernetes With the centralized CD software platform Would kind of pave the way For more advanced capabilities down the line What we found with teams using drone And I think a lot of this happens a lot of common YAML interfaces from CI tools Is that when people wanted to do more complex things Like Canary or Blue Green Or something more than just deploy this YAML They would write it themselves And there was just enough variation Across all the teams that maintaining it got very complicated And it was hard to kind of standardize a common way Of doing any of those one things Argo kind of made that a lot easier Because it abstracts a lot of the complexity away So our goal with providing this tool Was to pull away from people having to write A lot of the YAML themselves And we wanted to offer some predefined workloads Templates and best practices And adopt common workflows to align With what they were kind of already doing With the dev lifecycle So for the Argo setup We went with one Argo to rule them all There's a couple of different ways You can kind of run Argo similar to A bunch of different ways You can manage Kubernetes clusters You can install Argo along with each cluster If you are running one cluster per team Or even one Argo for dev One Argo for stage, prod Depending on the environments you have We're kind of doing this single control plane version Where we have a single Argo that Deploys to all of our environments Which simplifies things for users Because when they log in they can see They're apps running in dev and stage and prod And we also have Using the Argo pull request generator So people can get previews of pull requests As soon as they open them One Argo is kind of handling all that So it gives them a much simpler interface There was some kind of security controls That we had to balance As we moved forward with Argo So when we initially installed it Our security team came back and said Based on the way that The default Argo installation was done I was able to get in and do all these things With the roles that Argo set up that I shouldn't be able to do So they came back and said Here's some rules that you would like to follow So we don't want Argo to have cluster-wide admin access Which it wants by default We want the permissions limited To namespace level In the target clusters Like in the clusters customers are going to And you have no access to install custom CRDs Us being the team That's running Argo versus the team running The runtime environment Like don't assume that just because you're in the same org You have admin level permissions You need to be treated kind of like a tenant So our solution to this was a few things So we separated the pipelines that did The Argo workloads And what kind of handled the Arbok and the CRD installation The automation that installed the runtime clusters We kind of gave the CRDs over to them To let them handle the custom CRDs for us And that kind of limited Argo scope Of control that it needed We had to customize the Argo CD Helm chart a bit To separate out those CRDs and cluster roles And only be worried about installing Argo In a target namespace And then the last thing Which was probably the most complex thing we had to do Was customize the way that the service accounts The service account that Argo used To remove its kind of like great read Delete star permissions And do cluster roles and tenant level roles For the Argo CD manager And the permissions that it had for installing So this diagram maybe explains that a little bit more So the idea was within the target cluster The way that Argo wanted to be installed by default Was to have one service account That had access to do everything Install apps for every tenant across the entire cluster Including like the default namespace So we separated this to be a little custom We still have that service account in there But that service account has a role that only has read Or like get on a few things No get on secrets Just get on namespace level stuff And a few cluster wide things And then it inherits tenant level permissions In each of the namespaces The same way that a user would So it does have access to read And create and delete things in all of the Customer namespaces But it doesn't have anything at the cluster level So it can't delete the whole cluster Or do anything administratively And this was taking advantage of something That I'm going to talk about earlier Which is the Kubernetes operator That we wrote to provision the tenants And create the users Was able to give Argo those necessary permissions As part of the installation process So I talked about in the previous slide With installing the custom CRDs To solve that read-only role And then the operator took advantage of the stuff After that Then it wraps up the CD portion I think we want to start transitioning Over to general lessons learned That we did with each of these phases I'll go over the first two And then I'll hand it back to the last two One of the things that we have learned And are still learning Is that we're not a platform That we're selling out in the world But we're an internal platform So documentations Not that it's tougher I think it's tougher for everyone Not tougher for us But we don't have a team of people Writing documentation But most of the users using our platform Don't understand Kubernetes Don't understand I'm not sure how many people Really understand Kubernetes But it's a challenge To write documentation that's clear, concise And it's consistent across all of our teams Like the way that I write it Versus somebody in our runtime team Versus in the application templating team Everybody's context is different And the way that they write is different So we're just still learning How to make this approachable for users And trying to constantly get feedback The next thing is adoption and partnership Oops, did I lock my screen? I did We've been working a lot on migration I think I talked a lot about new apps We're trying to get people to migrate over And one of the ways that we've done That is just my partnering with teams To help them about what we've built And help them use our tools And I think partnering with them And betting people has been really powerful I don't think you need this I'll get back to you as a kiss So the other two pieces is A platform is a product One of the common things that I hear Is that let's grab it as a project And that's how we deliver it And go over it If you grabbed it as a project It would just deliver a few features That you have to listen And you have to iterate on top of it So the ways that we are keep delivering this Is we deliver our first iteration of it Hear from our customers Understand what we need to do And then iterate on the next level So a couple of things that we didn't have When we started is like for example Customized Helm charts that people can use Like we started to drop like Customized actually ML files into Rebers And that's where things got complex So we start iterate on top of this Another element of it like One that I like to say is like We build a platform It's not a tool So like all of the tools behind the scene Should like be changeable Like I can just like Plug like a different CI model Or again a plug different CD model Or different secret management software Like all of these pieces that have to Continue and to grow Based on the most important thing here Is the customer feedback like A really good thing to understand Is that we are building a product For our engineering teams As an engineer like I want to build the newest Chinese thing which is like Makes me feel good about what I'm building But like at the end of the day If I'm building something that's not Helping the product engineering teams Deliver their code faster It means like there's something wrong with the platform So customer feedback is really A critical piece here and then We have to listen and understand Like see what they really need Maybe like we are thinking about Very complex solution but they need something simple At the end of the day to just Deliver their application without any problems And in a standard way And that would drive it all So thank you all for being here And I did this again Your finger point is not going to work A couple more I have too many hot corners Now I got it so thank you all This is the feedback You are for the feedback session And we have other sessions coming On Thursday for scaling How we do scaling With our colleagues from the New York Times And we just did another session On ArgoCon about how we do Multitenancy on Argo So thank you all