 We're going to get started. We are going to be talking today about scaling Argo security and multi-tenancy in AWS, EKS, and we are from the New York Times. So my name is Dave Grzanti. I'm a principal engineer at the Times. I'm Luke Phillips, the staff engineer with the New York Times. And at the Times, our mission is to build the essential subscription bundle for every English-speaking curious person who seeks to understand and engage with the world. We're a digital-first experience, leaning into technology to produce comprehensive news coverage. As an example, what being shown behind me of our coverage of the Tulsa Race Massacre. And this particular story used machine learning to take historical records and create 3D spaces that are being mapped on screen. This specific story used a variety of microservices and data architecture to be able to map our news stories to this dynamic digital presentation. And this is just one example of things that the Times was trying to do to advance our storytelling and visual capabilities and build more comprehensive products. So today we're going to go over why we're building an internal developer platform at the Times, what Luke and I specifically work on within that team, how we're doing continuous delivery with Argo, an overview of multi-tenancy Argo architectures, how we're running Argo at the Times, some scaling challenges we faced, and some lessons learned. So I think for most people, what you may be familiar with at the Times is the news coverage. I'm sure people are familiar with Wordle, though, as well. So there's a lot of other products within the Times kind of scope, right? Games, crossword, Wordle. But we also have a very vibrant cooking website. The other platforms, Wirecutter and the Athletic, are also Times brands. There's also a growing audio presence, too, with podcasting and a few other things. So there's a lot of development and engineers at the Times who are building these products and teams like ours that are supporting those engineers. So if you were an engineer at the Times, application developer, we really want you focusing on developing software, developing stories like the visual one we showed you up there, but increasingly developers are being tasked with doing a lot more than just develop as things quote unquote shift left. And they're responsible for containerizing their applications, doing CICD, building them, testing them, dealing with ingress routing, monitoring, and their lives have just gotten a lot more complicated. So part of what we're trying to do is pull back on some of that complexity so they can focus on making their lives easier and also letting them get back to adding value for customers. And in addition to the developers, we also have folks in more of a DevOps SRE role that are responsible for running maybe Kubernetes platforms that these teams are deploying on now. And we kind of see a mix of how this may have worked in the past. Maybe they're using a cloud provider to run something, but they might be running their own clusters one versus two. Everybody kind of was doing it differently depending on the team. So what we really wanna do is consolidate all of that down to a shared platform. So future teams could deploy their applications and not worry about managing and maintaining and updating these. And this is really where our team, the Deliver Engineering team comes in, that Luke and I are part of. Deliver Engineering owns and manages these clusters to allow teams to deploy. They offer distinct multi-tenant spaces within the clusters and each team operates separate from each other and has Arbok and security controls. And if we kind of zoom out to look at that, the whole big picture of this IDP we're building, we want an NYT engineer to be able to create, kind of onboard to our platform, develop within some time the source control system like GitHub, have CICD tools to build, test and deploy to an AWS EKS environment, have some centralized ingress and routing for those applications and then be able to monitor them. So this is something that our larger team, Deliver Engineering is building. Luke and I specifically kind of land in the earlier parts of the diagram. Some of our colleagues are talking about other pieces of this system, but Argo is really the thing we're here to talk about today and why we chose Argo as our CD tool of choice. So I'm gonna hand it off to Luke to talk a little bit about that. Thank you, Dave, for the background on our developer platform. Given our platform and architecture of orchestrated container workloads or Kubernetes, obviously the emerging best practices for continuous delivery are utilizing GitOps. After evaluating the landscape of tools available, we went with Argo CD. This helps facilitate our continuous delivery and GitOps patterns within our internal developer platform. Some of the realized benefits of an internal developer platform, having an alignment of your software development, life cycle streams, your CI and CD processes, become reusable and repeatable, common tools and patterns. They improve velocity, quality, and support of software services that allows people like Dave and myself to scale our abilities as well. And being that this is ArgoCon, I don't really need to belabor much more of the CD benefits. But part of our any discussion about security and scaling requires a review of the deployment architectures and what kind of choices you make around operating Argo CD itself. So we wanna give a quick review of some of the published architectures of Argo CD itself and which was just the right size for us. Keep in mind that each example may fit better in your own use case. Starting off with a standalone model, each cluster has its own Argo instance. Some of the benefits to this is a reliability of each cluster operates independently, isolation of concerns, better security, distribution of load per cluster. However, this creates a lot of complexity for management and updates, as well as complexity in providing user access. This requires maintaining many instances and duplicating configuration. So this was not quite the right size for us, but we liked a lot of the security benefits here. Looking at some of the other options, you have your hub and spoke model, single Argo instance to connect and deploy to many Kubernetes instances. This is easy for management, creates a wonderful developer experience, one pane of glass to see all of your deployment applications, simple disaster recovery and simple access usage. The challenges though, it is the single point of failure. Scaling requires a lot more tuning of the individual components of Argo CD and the lack of isolation for security. So this was getting a lot closer to what we're looking for as far as the developer experience. And finally, one other set of architectures you can consider are Argo instances per group, per logical group or to split apart the components of Argo itself. Some of the benefits here is a little better load distribution per group and outage of one cluster won't affect every group and credentials become a little more scoped per group. However, you still have the challenge of maintaining more instances of Argo and requiring a separate management, sort of a separate management cluster. Alternatively, you can split apart the components of Argo CD, there's an experimental open source project open cluster management tool that does this. They'll also a great credit to some of this review and other architectural solutions that provide a better support around the splitting of components to check out the acuity or code fresh products. However, for us, we were looking at just the open source solutions right now. So balancing the various pros and cons of the architectures we were looking at, we really enjoyed the experience of one Argo to rule them all. We also have some smaller instances of Argo for our own testing and allows us to sort of dynamically balance clusters if we need to. But right now, kind of one Argo to rule them all. We also balance the trade-offs of our strict security controls with CI governance and policies around our GitOps repos and measured separation of concerns for the various repos of config. I'd also advise you to check out another talk later this week by some of our colleagues that will be going into greater detail about the Git security and policies we use there. And so from here, I'll transition back to Dave who will go into more detail about how we're using Argo CD in this architecture. Thanks, Luke. So I'm gonna jump into Argo at the times and specifically how we have it deployed and some challenges we faced fitting into our multi-tenant Kubernetes environment. So as Luke pointed out, this is the model that we chose from the kind of available architectures that we kind of surveyed. And like I mentioned, we're deploying this into a multi-tenant cluster that another team within our department runs that had some requirements for how Argo operated and our security team also wanted it to see it match some of the Arbok controls that were already there. So when we kind of set out to use Argo CD, like a lot of people, we went to the docs and used kubectl to apply the manifest and create namespaces, all that sort of thing. When we wanted to add a new cluster, a target cluster, we used the Argo CD CLI to do that. This, from our experience, and I think now assumes a certain set of control and permissions within the clusters you're installing it in, mostly cluster-wide admin, which we didn't have. We were kind of operating just like a tenant with limited permissions within this multi-tenant cluster. So when this Argo security team kind of looked at the way that we had installed Argo in our sandbox POC environment, they kind of raised some red flags and they were like, you can't install this in the multi-tenant cluster that the tenants are using this way. So here's our kind of requirements for you. You won't get any cluster-wide admin access for the Argo installation. You need to limit what permissions Argo's going to have in the target clusters where users are running their applications down to the same permissions, essentially, that the users would have. And you have no access to install custom CRTs. So like you can't, we're not gonna give you access to install the Argo CRTs. So from our side, we were contemplating how we would do that, what we had to split up, what we had to do ourselves, kind of work with our other team in our department to kind of manage all this. So what we came up with was these kind of three things. So the first one was bring your own R box. So we kind of separated the installation of the CRTs and anything that would require cluster admin to the same workflow that was being used to install and set up the Kubernetes cluster in EKS. This allowed us to kind of keep all the CRTs and that stuff separate. We customized the Argo CD Helm chart to also separate stuff out. So we pulled out the pieces that could be run without this admin access. We essentially just did that with like a bash script to circle over the stuff that we needed to separate out and use the pieces that we needed. And then the last piece was, and this is the more interesting thing that I'm gonna talk a little bit more about next, was how the role of the service account that lives in the target clusters functioned and what permissions it had. And what we separated was the traditional way that it's done, if you just follow the docs is it has create, read, update, delete everything on all namespaces. So we made it so that cluster-wide role only had essentially read permissions and then we attached namespace-specific roles that matched what the tenants would have. So it kind of had one service account with a cluster role and then a role per namespace. So let's dig in a little bit on what that looks like and how that works. So this is kind of a simplified diagram showing the Argo CD lives in its own admin cluster and this kind of operates on all of our environments, dev stage, prod. It looks at a set of Git repos for the project and app files and on the target clusters there's that service account I mentioned and it has the cluster role that essentially only has read and then in each namespace it has a role binding that essentially mimics what each tenant is given. So Argo has the same permissions a tenant would over all of the applications within that namespace and this worked out for us better I think than we expected in the beginning. We were worried we weren't going to be able to kind of match the security requirements. So we were able to achieve that. We didn't necessarily love this setup though because I think the more research we did we realized we wanted to lean a little bit more into AWS IAM native security versus using a traditional Kubernetes model with a service account and a token. So the next thing we kind of looked at was could we lean into the AWS IAM model a bit more? And after kind of reading through various AWS docs about identity mappings and blogs on Argo and how the kind of two could match this is kind of the model that we came up with and wanted to try to achieve. So this would translate the same cluster and namespace roles that I talked about but it would remove the need for the service account and the Kubernetes token. So let's go through how we did this. So just a quick primer on AWS IAM for people who aren't familiar it's the identity and access management component within AWS and it just controls who can access what within AWS. So it's kind of their layer of IAM separate from Kubernetes Arbok. So the first thing we needed to do was set up a role which we called Argo CD and this is what the Argo components within the Argo CD cluster will assume when it needs to talk to another cluster. That role will have two things set up the first is a trust relationship and this allows the Argo CD cluster roles to assume this role via trust relationship. The next thing, oops I skipped over one, no? Oh sorry, I didn't skip on the next slide. This is doing the patch within Kubernetes to let those service accounts assume the role via an annotation. The next thing was the IAM policy on that account and I'll get to why this is important in a second but this allows this role to assume the role in the target cluster from the AWS side and if you kind of like think about this as a big picture we set up a role IAM role with a trust relationship and a policy. This is something typical you would do with a lot of if you're creating IAM roles in AWS but I was just stepping through the three pieces that were needed from the Argo CD operation side and the next thing is creating the role that this Argo CD role will assume in the target cluster and this can be done across accounts so I kind of showed how this can be separated and the important thing is to look at it just like the item in the principle showing that this trust relationship allows the Argo CD role to assume it and this role won't have any specific IAM permissions to do anything in AWS. It'll just have a trust relationship which allows that Argo CD role to assume it. This is what the cluster role and cluster role bindings look like specifically that I've kind of been mentioning all along. The one on the left is the tenant level role binding and role that are in the tenant specific namespaces and the one on the right is similar to what you would see if you install the Argo CD out of the box for the cluster role except it only has list and watch on the cluster no create, read or delete permissions and then to kind of complete the circle here to link the IAM role down to the Kubernetes kind of native Arbok, AWS has this concept called identity mapping which you give it the AWS kind of identity and then the Kubernetes identity and that's how it does the linking. And then lastly when you register the target cluster within Argo CD, you just have to do some specific stuff within the configuration to say this is the role I want you to assume when you're talking to the target cluster, here's the name and here's the certificate data. So this would look a little different if you were just connecting the cluster with Kubernetes tokens. So with that I'm gonna hand it back to Luke for a second to talk about some of our scaling challenges. Thank you Dave. We'll take a quick look at the Argo components themselves. Given the style of architecture we've chosen, what kind of scaling challenges are we going to run into and what do we want to make sure everyone is just aware of with Argo? Thinking of Argo CD components operating in three phases. The visualize, the apply and the retrieve. Each phase presents scaling challenges to be aware of and tuned for. Our own challenges exist for the many cluster connections and the many GitHub repos were connected to. An option to extend what David mentioned earlier we could also even go so far as have one cluster connection per tenant per cluster. This is a challenge for future us to be aware of. So these are a collection of all the challenges you might face given the style of architectures we were looking at. For the many Git repos or mono repos we have a variety of parallelism or cache processing on the repo server side. For the many deployed applications or many cluster connections you have those challenges likewise with the large size of data being retrieved from the API that the body of the API itself might grow. And so to solve for these challenges first we wanna make sure we do thank the Argo community and the Helm chart project. Be sure to check it out. They have already tagged a lot of the out of the box flags and values ready to be set in the chart. So you have some tuning already available to you. I'm not getting into too much detail here as many of the other talks you might have heard today or we'll still see today. We'll get into some of these further details but for our specific challenges for the many Git repos being aware of all the timeouts and your polling and processing, forking, memory usage. As we mentioned earlier, the part of monitoring of our platform to be constantly aware and seeing how Argos growing and what kind of memory tuning you have to do for all of your infrastructure. Likewise for the many cluster connections you have a lot of tuning around the processor queues and starting to look at charting out the replicas of the cluster connections of the application controller and finally for the API itself really just watching the memory usage of that. And so finally we'll discuss some of our lessons learned through all this work that we've done. We'll start off I guess some of them, yeah. As we continue to investigate how we tune our security in multi-tenancy we want to continue to lean into the platforms that we're adopting, whether it's AWS or Argos CD. We're certainly interested in having this conversation with all of you, what ideas do you have for both AWS, AWS IM, roles, Argos CD security and multi-tenancy or where do we think the projects need to go or like where Argos CD needs to go. We certainly have a challenge for example of our Argos projects replicating the same security controls that we're managing in AWS IM. Is there say any alignment we could get there in some future capabilities or tools? Yeah and then the next slide was just one thing that we were kind of playing around with messing with with this AWS IM model which is the concern I think we have is if someone gets access to the kind of token service account that we have now they essentially have access to every app deployed in Argos CD because we have this one Argos CD to rule them all, which is not great. Leaning into the AWS IM model is better, it's using more native concepts but it's still you get access to Argo in our ops cluster, you have access to everything. So we've been thinking is it better to do roles per tenant and then kind of register those within our Argo cluster as individual target clusters per tenant, at least then kind of have separation but that doesn't really change the Argo access model, Argo itself still has access to all those tenant spaces. So we've just been trying to toy with some of these ideas and see if there's any kind of more individualized secure way we could make this so that the tenant stuff is tied more to get and there's a full life cycle of CINCD per tenant. And I think that's it, thanks everybody. And two other talks that we've been mentioning so we had a talk from a colleague this morning at SilliumCon about more about our runtime environment and how we're using Sillium. And then we have a talk later on Thursday about OPA and how we're using a policy to control some of our CI process and who can deploy into Argo and when, thanks. Thank you. We have time for maybe one question, questions? Oh, one right here. What's today's wordle? Yeah. Hello, sorry. My question is related about the, I see you are using a one cluster with multiple tenants. And what is the, how do you define or how do you define the tenants? I don't know, a specific namespace or multiple namespaces belong to one tenants or how do you define this? So yeah, like for me from the, like if you ignore Argo for a second, a tenant is a namespace, but we do have a concept that we're allowing tenants to create multiple namespaces, but there's some controls on top of that because we don't want, we don't give the tenants the ability to create namespace, namespaces themselves. So we've kind of, we wrote an operator that gives them limited permissions to make new namespaces if they want to. And then if you look at it from the Argo perspective, how we're managing that is we consider like a team or a tenant, like a one-to-one mapping with their, their like AWS or their cloud account. So, you know, they get out of the box, they've got a namespace that maps back to their cloud account. They can deploy multiple apps into that, but they do have the ability to kind of create sub namespaces if they want to. And it's one Argo project per tenant. And then that maps to basically prefixed namespaces with their tenant name. Okay, thank you. Thank you very much. Give a clap.