 Our next talk comes from Philip Schwartz. Philip is a principal software engineer and technical lead of the continuous delivery platform team at T-Mobile. In this role, he leverages 16 plus years of continuous integration experience to revolutionize the way that T-Mobile builds, manages, and operates their future usage of CI CD for all development within T-Mobile. Philip has built a high volume CI system that supports multiple thousands of developers. In Philip's talk today, which is entitled Extreme Iterative Testing, Four Million Jobs Per Month and Growing, Philip will share some of the tips, best practices, and lessons he's learned running CI CD at massive scale. Extreme Iterative Testing, four plus million jobs per month and growing. It's an extremely interesting topic. My name is Philip Schwartz. I'm a principal software engineer on the continuous delivery platform team from T-Mobile. I have a little over 16 years of CI CD experience with experience from small to very large scale production environments, ranging from hundreds of thousands of servers managed and deployed to from these systems. They've ranged from on-prem to cloud to hybrid in a multitude of actual environments, ranging from web hosting, big data environments to telecommunication. Today, I wanna talk to you about CI CD. What is CI CD? CI CD is the process of continuously testing a software application with integration to other systems and then continuously delivering that tested software to production. And I call out two things in this specifically, integration and that it is to production. Most people look at CI CD and they look at what are the overall ideas of it and see it as, oh, I'm building my software. And that is not the case. CI is completely about how your software integrates with other applications in your environment and how you can manage that process of all of your iterative testing until you can deliver that software to production. CI CD is the new norm. And I call attention to one of my favorite books in this, The Phoenix Project by Gene Kim. This is a very wonderful story that talks about how investing in environments and continuous integration and continuous delivery and testing is key for a company to go from a large scale disaster in a software application into an environment where they're deploying continuously and they're delivering new products to their customers. This shows a goal that all modern software projects try to achieve, something where you're able to deliver in a very timely manner and as fast as possible. Inside of the CI CD world, there are a few fallacies that are always lumped together. CI is not actually building of the software. Yes, a lot of people build software in their CI pipelines but that is actually not the delivered point. The point is to take your built software and be able to test it between all of your environments. How does it work with other software and be able to do it in a repeatable fashion? The other problem that is here that a lot of people don't focus on or realize is continuous delivery is not expressly about your deployment but delivering that tested software. There's a lot of cases where you would want to deliver the software tested through all of your integration and system tests, validated, ready for production but it would actually not be directly deployed. So it's all about the delivery of that software, not actually the deployment of it. But as I said, we all include these things in our CI-C environment. It's okay, you can admit it. Everybody does it because it's easier. A common pipeline that includes everything is always best when it comes to your developer experience. That's why it is a common practice to CI all the things but there's another issue here. Your CI-CD environment becomes your bottleneck. Jobs back up as your capacity is reached as more and more developers attempt to use it. This causes developers to wait. And this is one of the big things that we are striving as a group to improve. My team tries to do it within T-Mobile. The GitLab CI team tries to do it as a whole and drive the community in order to perform better. Traditional CI-CD pipelines have very rough times when it comes to scaling. Jenkins servers can only scale to so large. You have to manually manage all of your static agents that you're working with. There are modern models that have been bolted on but they're after the fact and you have to worry about plugins and things like that inside of these servers acting up and causing issues. But again, the issue with it as a whole is manual management. As you're managing these pieces you're doing it sometimes with automation, sometimes by hand with the UI but at the same time you have to realize you have to do things like tuning the JVM that the application is running in and understanding the resource usages and how jobs can be noisy neighbors causing problems for other jobs. And this leads to a very large overhead for the operations team that is running this or in most cases where teams are running these for themselves on individual developers to maintain this. GitLab CI pipelines is a completely different beast and rightly so. It is a project that's based off of Golang using the GitLab runner. It is able to be deployed in multitudes of patterns whether that is deployed to Kubernetes clusters to run via pods, running as direct Docker containers on individual hosts, manage manually or dynamically or managing EC2 instances dynamically spawn through something like Docker machine. The idea of the project is to be able to scale rapidly and with minimal effort but I'll emphasize that on minimal effort until you reach a very large capacity. The larger you grow in your CI scale the more impact you'll have on a day-to-day operation standpoint. But one of the even more beneficial patterns of this as a whole is you can define very strong configuration options in order to define differentiating runners for teams using these pipelines. Whether you wanna provide a user with a runner that has a gigabyte of RAM and one CPU exposed to their CI job or has 128 gigabyte of RAM and 32 CPU cores you're able to define these and do it in a manner that is repeatable and reusable throughout the entire use of GitLab CI. T-Mobile has been headfirst into CI CD for many years and I'd like to dive into a lot of what we've done at T-Mobile and what our evolution has been as a whole. This is an evolution that has scaled almost eight years time and has had a lot of iteration pieces. Like most companies with what I call our ancient history is we started out the exact same way that you see a lot of companies. Either a single team or a shared organization is managing something like a Bitbucket server, a Git server itself with source code for your single team or multiple teams. Individual teams are deploying and managing Jenkins servers or bamboo servers. They're managing all of the agents that are connected to these in order to run your CI jobs. All of the CI jobs are created using the point and click UI, stringing together plugins, stringing together actions, pasting in blurbs of bash scripts in order to control how your CI job is running. But at this point, everything is being done manually. You push code into a repository. The code is officially in your mainline branch and then teams are kicking off manual jobs to run and start doing a build unit testing. If there is any integration testing done, trying to connect through those individual pieces, but you're stuck with long-lived environments and you're not able to really bring it together. And it rarely includes the delivery of this built software. It's all completely self-contained but completely separate between all of the teams within a company. And it was that way at T-Mobile. Individual teams would manage these setups. And by the time we started to migrate off of these, we had close to 30 Jenkins servers being managed by different teams and close to 15 bamboo servers along with things like four to five different BitBucket servers. So source code was everywhere. CI CD was everywhere. There was no insights into what was going on and it made it to where each team had to have experts to manage the pieces for their team. The middle ages within T-Mobile was the start of the epiphany of where we are go. We entered into the dark ages and we passed over it into where we entered what we see as almost our renaissance. We built a shared services team. We introduced CI CD as a shared service, migrating to things like a centralized BitBucket server, adding in a centralized Artifactory server to store these built artifacts that would be delivered to production environments. We centralized Jenkins servers and we turned and had shared agent pools to be able to run all of these jobs and also included adding other things like SonarCube to be able to do security scanning. We even moved to building all CI jobs as code using common shared libraries and developers writing Groovy. As you would expect, this meant we had to have developers learn Groovy if we didn't know it. Being a Java shop, it was not the biggest stretch a lot of developers were able to take it on but it is something new that you would have to have individual teams learn in order to work with. And this worked. We scaled to where we had almost 20,000 projects running on this environment and we had close to 5,000 developers actively using it daily but it started to show its age quickly. The tools were all self-managed from the shared services team. We had to scale them rapidly and if anybody has ever worked with a large Jenkins server you know how that can go. And when I say that our Jenkins servers where we were running three of them at a time were using almost 192 gigabyte of RAM you would feel the pain very quickly. This brought us to where we have our current modern history where we are today, what we're doing. And this is what we call CDP the continuous delivery platform at C-Mobile. It is built on GitLab.com went 100% in on GitLab CI. It provides shared templates, shared functionality and forces the usage of things like GitLab SAST and DAST. It still provides SonarQ for scanning and adds in other things like AquaSec for container scanning. But the big benefit to this as a whole was moving the management of our Git SCM out of the life cycle of our shared services team allowed us to focus on our GitLab runners and what we could do to really scale CI CD at C-Mobile. We started with Docker machine for our GitLab runners. These were deploying to EC2 instances inside of AWS and we're being used in an ephemeral pattern to rapidly allow teams to run CI jobs and allow our CI infrastructure to scale out horizontally. The future of where we're going is looking really fun and really interesting as a whole. It is that drive to allow us to scale much faster and be able to work in a pattern that can function quicker. And this is going all in for most components on GitLab. It's still remaining with GitLab.com and the SAST platform but it's moving to GitLab runners inside of Kubernetes running via pods instead of having to spawn EC2 instances for every single job that is running. It is moving to really highly depending on GitLab SAST and all of the new GitLab compliance pieces to be able to see the quality of the code, look for security issues, do scanning via things like the securities template and using GitLeaks to be able to look for the possibility of leak security secrets throughout our code bases as well as still maintaining with things like AquaSec for being able to do container scanning. And this is our goal. This is where we're driving to and where we're moving to and iterating towards rapidly. Currently inside of our CDP platform, we have just about 8,000 registered users. Of those 8,000 registered users, we're seeing an average activity of about 6,200 of these users in a 30 day period, which means there's a lot of people that aren't actively using it but we do have a very large user base that is functional. These users have access to and are working with over 23,000 active projects within GitLab and these are broken up through a multitude of subgroups and organized between teams and projects that are happening within T-Mobile. With this, we also have another 13,000 projects that are archived over time as inactive. Just so you're able to see the type of scale we have currently inside of our environment, almost 36,000 source code repositories that we're managing history and compliance on as a whole. With this, when you look in these 30 day periods, we're seeing from our developers an average of 20,000 merge requests per day. And with those merge requests, every single one will be guaranteed to run a pre-merge pipeline as well as a post-merge pipeline, allowing us to really drive how things work inside of our environment and what developers are seeing and what they're working with. When you look at this scale as a whole, and this is a little bit of our metrics of where we are currently, in the last 30 days, we have launched 500,000 CICD pipelines that can be chained anywhere from six to seven jobs to some pipelines that have 700 jobs. As part of this, we've run 4.5 million CICD jobs. That's 4.5 million jobs that have been run in individual Docker containers throughout all of our CI processes at T-Mobile, whether it's doing GitLab security scans, whether it's doing Sonar scans, whether it's doing builds of software projects, pushing to a package registry, building a container and pushing to a Docker registry. All of these jobs that run are going through our current GitLab runners. We've been able to maintain an average runner, an average pipeline duration of eight minutes. This means that most teams have an average time until their pipelines completed of under 10 minutes, allowing teams to actually iterate on code much faster. With this, we've also driven to try to do delivery and deployments to environments much faster. Our NPA, our non-production environments, are now seeing an average of 80,000 deployments per month. When we look at our previous platforms before this, this is a 45-fold increase of how fast we were deploying to these non-production environments with our previous shared platform and even well beyond that from before it. We're also seeing currently upwards of 40,000 production deployments per month, whether that's deploying elements to average and run the T-Mobile websites or run the point of sale services or any of the T-Mobile applications. They're in this actual platform and being able to run end-to-end from ideation of new changes and new features or bug fixes to development, testing, validation, deployment into an NPA environment with automated integration tests and then deployment into a production environment with validation tests, allowing our developers to really, really be able to iterate. But from here, what I would like to talk to you about is where we see struggles and where the type of struggles you see as you grow in this type of CI scale. As I had said, we started with Docker Machine and this is where we reached the struggles initially and our first take on it in the first couple months, we reached to where we were running 400,000 CI CD jobs in a 30-day period. These were purely ephemeral GitLab runners. A request would come in. The first GitLab runner process would see that it needs a job to run on. It would use Docker Machine to spawn an EC2 instance. That EC2 instance would come up. The job would run on it and it would return results and be recycled. We were able to average about 800 EC2 instances run per hour in this environment and we were using one AWS account and one region. We learned very quickly that this has a hit because of the new AWS rate limit pools where they add rate limit to a bucket that slowly degrades and builds. And what happened is we started to see an average per hour of 300 create instances fail and get forced into retry loops until there was available rate limit in order to launch instances. We were able to get up to about a 1200 job concurrency and small bursts, but as we hit that actual concurrency level where there was that many jobs running at the same time, we ended up getting even higher levels of these retries. And the more retries you have, the slower it is for pipelines to run, the harder it is for teams to use the system and the more painful the developer's experience is. Which comes to take two. We realized as we scaled, we needed to have a bit better patterns. We were bringing more and more users onto this platform and migrating them from our previous one and we moved to what we called a mostly ephemeral pattern. We would launch these instances with Docker Machine, but we set the settings inside of the Toml for our multi-runner in order to drive these to have about a 20% reuse. So instead of always building new instances, we were trying to reuse part of them in order to reduce that rate limit issue from AWS. In this, we were able to average about 1900 EC2 instances run per hour, but we also doubled down on what we were doing. We now had two AWS accounts. We had two regions in each of those AWS accounts in use in order to spread out our rate limit usage. But we realized we still had a lot of moderate retries popping up. We still had the same average of about 300 create instance retries per hour because as we realized, because we had more capacity, developers were jumping to use it, so our concurrency went up and because the concurrency went up, the rate limits were still there, which moves us to where we are currently today. We still have struggles with it, but for a whole nother reason. We're currently set up to where we're reusing about 90% of all of the runners that we start up. We're averaging about 650 EC2 instances per hour across these same two AWS accounts and the two regions per account, but we're not getting any EC2 retries anymore because we're not going to spawn instances as fast. We're using the instances a lot more than they were before in order to work. We've been able to burst upwards and above 3,200 concurrent jobs. That's 3,200 Docker containers spread out across these EC2 instances that are running in order to let these jobs function at the same time and we never had any AWS great limit retries in this, but we reached a new problem. And the problem with that is a telecommunication company like T-Mobile is very starved for IP addresses. We were only able to achieve inside of these two AWS accounts two slash 21 subnets or about 4,096 IP addresses with a large chunk dedicated to VPC infrastructure towards routers and different things that are needed as well as our multi-runners and other infrastructure. So we sit with about 3,800 IP addresses at a time that can be used. And we realized that as we burst to bigger and bigger numbers, we reached IP starvation because we're able to use all of these IP addresses. But what type of thing can we do to get around this? What is possible? This is where it brings us to what our future is and why we're looking at Kubernetes and why we're driving towards this. Our managed Kubernetes clusters are not restricted on IP addresses. They have their own private service network. So it allows us to go back to a purely ephemeral model. Every job has a GitLab runner that is dedicated for it. There's never any reuse. And in this instance, every GitLab runner will be in its own Kubernetes pod. We're gonna be able to do this across a much smaller footprint in both AWS on EC2 instances and Azure instances as we add more cloud functionality because we're able to use very large nodes and then populate them with as many pods as possible. In our testing of what we've seen each Kubernetes cluster that we're standing up being able to do, we're targeting to be able to do bursts of greater than 5,000 concurrent jobs. No need for retries because we're never hitting the AWS API. No issues with IP starvation because we're using a completely private service network that can scale to whatever size we need in order to function. All of these jobs that are running have some very interesting customizations that we use at C-Mobile. We have a custom service that runs as a sidecar next to every single job. This is responsible for scraping our environment variables, getting build timings and sending all of these metrics into our data platform. We also are using these to expose things like shared scripts that are usable inside of jobs as well as doing things like exposing our internal custom CI certificates and things that are needed for individual jobs to reach systems within C-Mobile. As part of this, we have what is called the CDP metrics emitting telemetry, alerting and logging system or metal, which is a complete metrics platform and data platform. We're using it for reception of all web hooks coming into our environment being sent out from GitLab. We're processing them, indexing them and using them to build graphs on. We're receiving all group and project metrics as well as all of our build logs, runner logs and system logs for all of our tools as well as our runner cluster. And then we're able to build really nice visuals and dashboards using Grafana based off of these. And we're able to do it to where there's a view at the executive level, at management levels, down to developer levels that allow teams and individual developers to get really good insights as to what's going on inside their CI CD life cycle. We also have a custom setup of tools to manage what we call GitLab ACLs as well as settings. This includes doing GitOps management of group and project membership using YAML files and the standard merge request flow. A user is able to do a merge request, requesting access to a specific subgroup, specific projects and a group of approvers from that group and project are able to manage them and approve them directly in the Git workflow that they're used to. We also have a blanket set of defaults that we apply to every single group inside of the T-Mobile namespace, inside of gitlab.com and settings for projects that are similar. And we use the same type of GitOps pattern to go and apply these across all of those projects in order to be a safety net for how we manage and how we do everything as a whole. Future work that we're looking towards and we're doing with these, and this is a couple of the pieces just to have the idea as to where our mindset is. We're really looking right now at using functions as a service through open FAS as a whole. We're looking to build new tooling that we're able to develop in a very fast and feature forward orchestration and architecture that allows us to really modify and change the metrics we build. Change how we work with webhooks coming in. For example, the group membership webhook that is new to Gitlab from a couple versions ago release-wise. We're working on a pattern to where we can receive those webhooks and then perform actions inside of a number of areas of our T-Mobile namespace based off of that webhook that we received and be able to really function with it. And we're looking to do it in a stateless pattern that can really scale to match the same volume that we do in jobs running as a whole. We have a pattern that we're looking to use with Open Policy Agent to do pipeline validation. This is using Open Policy Agent and their Rigo language to actually do validation of CIML, be able to look for specific includes, specific jobs, verifying that things that we wanna do compliance-wise are included in their CIML before they're allowed to merge it or run it. That way we can make sure that we are having as much security and forward thinking compliance ahead of time even before changes to jobs are made within the CIML for a project. And the goal between all of these type of things is 100% developer experience. How can we really focus on driving these interactions and making it easier for our developers to use? What common tooling and standards are we able to put out there? And what are we able to do to grow? Currently, we're at 4.5 million jobs per month. Where we see this platform going as we increase our user bases anywhere from six to eight million jobs per month. And we wanna be able to make sure that the experience our developers have while using this matches what we feel is the ideal vision for our product. I would like to thank you for being here and listening to this chat. And I'd love to answer any questions in chat as they come up. Have a really good day and please enjoy the rest of this GitLab commit.