 as developer experience engineer at Grammarly and teach DevOps to the people with different background and skill sets. Thank you for being here with us and the stage is yours. Hey, thank you for introduction. Hi, everyone. I'm thrilled to be here today, actually. I want to share our case study of using Kubernetes, Carpenter, and some other tools to build a scalable and cost-efficient infrastructure for CI runners. It is built around GitLab and AWS. And I believe you'll find some helpful tips or insights for yourself, even if you prefer some other CI or cloud product. Let me introduce myself first. I'm Serhiva Selenko. I've been an IT staff for about 12 years and currently working at Grammarly. And I love it. My focus area is CI CD. And I'm a big fan of AWS and Hashacor product offerings. Here is what I'm going to cover in this talk. For starters, the context. Why we decided to create a new CI CD infra and how the whole backend infra and code management look in general. Then I will explain how the one account contains our CI CD infra and how CI job get access to resources in other AWS accounts. After that, we will need the Carpenter node provisioner. And I will explain how it works and how we use it. Then you will see what challenges we have around Carpenter and how we deal with them using the Keverna project. Finally, I will talk about the way we use the GitOps approaches with our CD to deploy all that stuff and keep it up to date. I will do my best to make that presentation interesting for you, whether you are proficient in the Kubernetes world or just starting your journey. Let's go. Our backend infrastructure is hosted at AWS. We have around 200 accounts governed with AWS organization and the number keeps growing. Why do we need that many accounts, some might ask? Well, a single AWS account represents the environment like QA, pre-prod or prod for some project, some backend service. Each account has its own VPC and transit getaways help us to connect accounts with each other when needed. We use a self-hosted GitLab to arrange development and code management process. Code projects repositories are organized in groups by teams that develop these projects. Around 400 users maintain around 1500 projects by this day. And because we want to maximize collaboration, we need to make sure that everyone can contribute to any project. This actually poses three challenges. Well, maybe more, but like three main challenges, I would say. If everyone can contribute to everything, they need to run CI for feature branches and sometimes run tests in QA of projects they do not own. And sometimes they need to manage or update AWS resources in that accounts. And we need to keep it secure and auditable. The previous version of the CI infra we had could solve this, but in a way that would still have a big room for improvement, so to speak. Well, CI runners were deployed in every AWS account. There was no auto scaling for runner instances. We had a self-service with best guest provisioning when users picked desired instance size and there was no capacity review later. And when several GitLab projects needed access to the same AWS account, we had more and more runners instances created in that AWS account. So let me explain now how we solved all that, what solutions look like and what the cost looks like. And that leads us to the one CI account to rule them all section. We decided to go with Elastic augmented service in a separate account as a platform for CI CD because we wanted a centralized resource management. We needed to keep up with the networking model across the organization. We wanted more and more customization for CI and of course, we needed the horizontal scaling. While network access is fairly simple and based on transit gateway routing with combination of security group rules, the IM access was a challenge we needed to solve. How CI job in CI account would access other AWS accounts. We leveraged service accounts in Kubernetes to solve that. GitLab can specify a custom service account for the pod that contains a running CI job. We also need an open ID connect provider for the cluster to make a cross-account IM role assumption from the pod running inside that cluster. AWS accounts for the projects have a special IM roles. They are used by CI jobs and they trust their assumption only to the defined open ID provider. Let me clarify that using the diagram. So here is a pod with the CI job. That pod has a service account with an annotation that represent an IM role in another AWS account. Through the open ID pod effectively assumes the IM role inside another AWS account. But that role has a special trust policy that not only limits the scope of principles to the open ID provider of our cluster but also allows assumption only to service accounts. But it is way much simpler for users. All they need to do is specify the account name in the CI configuration and their CI job will be able to access AWS resources in that account. You might notice additional CI job variables in that example. This is how users can set the desired amount of resources per CI job or for the whole pipeline and that actually leads us to the next part of my talk. Resource provisioning and scheduling. We started with a well-known Kubernetes cluster auto scaler but soon we switched to Carpenter quite fast once we discovered its features and potential. Carpenter does not require node groups and it's highly customizable in terms of EC2 nodes, types and purchasing options. Carpenter observes the aggregator source request of unscheduled pods and makes decision to launch and terminate nodes to minimize scheduling latencies and infrastructure cost at the same time. But it can also coexist and be aware of nodes provisioning outside of Carpenter. And you can have several Carpenter provisioner configurations in your cluster. Here is an example of Carpenter provisioner configuration. Here we define instance families, node types to allow Carpenter to choose from but we set some limitations, skip only the smallest and the biggest types. We also set both on-demand and spot participation options so Carpenter could run spot when a user needs that. So when GitLab projects spawn hundreds of jobs, therefore pods, it is okay for us. If we have nodes that can accept jobs, they will be used. If we don't, Carpenter will create as many as needed to meet the resource demand and pods will be scheduled there. When that makes Carpenter great for GitLab CI and for us, well, because in that once CI job ends, the pod dies. So it means when a node becomes empty, Carpenter will terminate that node right away. While just-in-time node provisioning is great for costs, it is not so good for user experience because node provisioning takes some time. So we made some calculations and experiments and decided to keep a certain amount of instances always running so the most active projects could have much more chances for the CI job to start and avoid losing time on node provisioning. We call those nodes worm pools. But just so you know, Carpenter supports TTL for empty nodes as well. So it can keep nodes alive, even empty for some time. But again, for our use case and conditions, the worm pools tactic works better, at least for now. We keep that in balance with our budget expectations and so far it is okay for us to look at the usage metrics and adjust the size of that so-called worm pool time to time. We believe it is fair price for a better user experience. Okay, so a user can set a AWS account and resource request from GitLab CI config, but how about other stuff? Control of on-demand and spot launch or sometimes GPU for ML related automation or maybe they need a particular instance type and family for whatever reasons? Well, GitLab cannot tell Carpenter how to do that, but there is another thing that can. And here goes Kiverna. Actually, Kiverna or Kiverna, I hope I pronounced this correctly. Kiverna is a policy engine. A policy is just another kind of Kubernetes resource. So it integrates seamlessly. A policy can validate certain conditions and mutate, generate or delete Kubernetes resource on-flight. Kiverna runs as a dynamic admission controller in a Kubernetes cluster. It receives validating and mutating admission callbacks from the QBIP server and applies matching policies to return results that enforce admission policies or reject the requests. Here is a policy example to elaborate on that. We check the annotations of the pods here. GitLab allows setting arbitrary annotation in the CI config. So if we have capacity type spot annotation, we add tolerations and node selector fields to the pod configuration before it's scheduled. And when it gets to scheduling, Carpenter processes this added fields respectively and the pod gets placed onto the spot instance. And again, it is much simpler for users. They only need to add that annotation explicitly in the job config if they want to run CI job on a spot instance. Another example is the even spread of pods across availability zone. This is applied by default to all pods, no user interaction here. We need to make sure that we use all availability zones and the ReForce subnets to avoid IP address shortage in any of them. This example concludes Kiverna topic and lead us to the final part of the talk. Our approach to cluster management and deployments. We choose Argo CD because it aligns with our approach to infrastructure management in a declarative GitOps way and it offers us a template-based approach for the setup with a high degree of flexibility. As you can see, we have a lot of stuff managed by Argo. So let me explain two main approaches we use to make that happen. In Argo, a logical group of Kubernetes resources defined by Manifest is called an application. We have many applications in Argo that support the purpose and work of our EKS cluster. And their set and configuration are static, defined in the code base manually and updated only when we need it. They share some common values, though, for example, environment variables name and environment name, I mean. Argo offers an approach to manage that kind of setup. It is called the app of apps. There is a parent app with specs for its children, a Helm chart. Child apps, also Helm charts, are organized into sub-projects. So we don't add new apps to that frequently, so it's a convenient way for us to manage this. However, things get more interesting if you have a dynamic set of applications. I know it's been a lot, but if you remember from the beginning of the presentation, there are service accounts that we use to provide access for CI job to other AWS accounts, right? So we have a lot of GitLab groups and a lot of AWS accounts. And because we'll follow the principle of least privilege, we do not allow access to anywhere for every GitLab project by default. Instead, we have a self-service portal where users can request the needed access. Effectively, it means creating a service account with a needed annotation for the IAM role and putting it into the namespace of a particular GitLab group. Therefore, we have a dynamic set of resources here because users initiate their creation and the number of GitLab groups and AWS accounts is not static. Argo has another approach to deal with that. It is called the application set. Application set is a custom resource that generates configs from templates. It supports different methods for generating the content. For example, based on files and folders content or a list of values. The application set controller is installed alongside Argo CD within the same namespace and the controller automatically generates Argo CD applications based on the contents of a new application set custom resource. As an example, take a look at this. We have a set of GitLab groups described in JSON that we use for generator. We also have some files that contain values for the supporting resources needed to create the Argo application. So when a new request comes from a user, it contains pre-generated values for service account and information about GitLab group. Combining that with the rest of the configs, Argo creates a new application that makes the CI job to the AWS account access possible. Another cool thing here is that changes map to the application set in template will automatically be applied to every generated application. This kind of helps when you suddenly decide to change your naming convention, for example. Once again, a look from the user's perspective. A user interacts with a simple web form and our automation tool do the rest. A special service called platformer processes the form data and creates a merge request with the needed changes to Argo. Once changes are matched to the main branch, Argo applies them in the cluster. Ultimately, users can access AWS accounts from their CI jobs in the projects. Last but not least, the money. While I may not share the exact numbers, I still want to give you an overview of what we had versus what we have. Long story short, the new infra costs almost five times less compared to the previous one. We have not yet finished with the cleanup of the old infrastructure because some projects have their reasons to migrate slowly, but you can see that even if the sum, the new infra cost with leftovers, it is still drastically lower than it was. That was the final slide. Thank you for your attention. I hope my presentation was helpful to you.