 So again, my name is John Scarbeck and I'm going to have a conversation with y'all regarding how GitLab.com has added additional clusters to their infrastructure to reduce overall cloud costs. So I am John Scarbeck, I'm a site reliability engineer for GitLab, specifically on the delivery team. I've been with GitLab for about three years and as you can see from this image, I'm also a beekeeper sometimes. In this conversation, I'm going to talk a little bit about how GitLab.com got started with Kubernetes. I'm going to talk a little bit about how, one of the problems that we encountered with cloud costs and how we got set up initially, I will then dive into the solution that we chose to go about solving that problem and some of the wins that we got with the solution that we had chosen. So firstly, what is GitLab.com? We are the GitLab product itself at SaaS scale. So we're serving roughly 2 million plus customers at this moment. It took us a while to get started in Kubernetes land. We've got a large infrastructure. We've had to scale out quite vastly over the course of time. And in doing so, our infrastructure team has had to break up parts of our infrastructure into smaller chunks in order to make it a little easier for us to fully manage that large infrastructure at the scale that we operate. At the same time, we've also had customer demand to create a Helm chart that would enable customers rather to install an entire GitLab application on-premise inside of their existing Kubernetes clusters. So with those two building blocks in place, we've been able to get our feet wet within Kubernetes. And at this point, we now have, today we have a hybrid architecture, both virtual machines that are still managed via Terraform and Chef. And we now have about 90% of our front-end workloads are inside of Kubernetes at this point. We're done a few more services, but I decided to stress myself out and do a talk. So here I am. So when it came to getting started with Kubernetes, our infrastructure team needed a way to simply or create a cluster very quickly without having to work terribly hard. We did not want to go the route of using our existing infrastructure management practices to build clusters, manage the upgrades, and figure out how to troubleshoot them and such. We wanted to do something very quickly, but we wanted to get customer workloads inside of Kubernetes as quickly as possible. So with that, we use Google's cloud platform as our infrastructure provider. So it was very natural for us to go down the route of checking out Google Kubernetes Engine or GKE. In doing so, this allowed us to focus our efforts where we deemed things more important to us. That included items such as making sure we know how to manage and deploy our actual configurations for those clusters, integrating those clusters with our existing infrastructure, making sure all the applications interoperate very well between our virtual machines and our clusters that we deployed, and our observability stack. We want to make sure that as we migrate things between virtual machines and Kubernetes that we can still view the metrics of those applications as we focus on its migrations. So with all this is in place, we're able to now migrate a first component, which in this case was the container registry of GitLab. This is a relatively lightweight, going application. There's not a lot of dependencies between that and other services within our stack. It moves, it does not move a lot of data, but it does serve a lot of traffic for customers. I think our own CI is probably the heaviest hitter of our container registry at this point. But in doing so, this was a perfect application. It's stateless, so it was able to get us, get our feet wet in the Kubernetes. We are able to figure out how the best way to migrate our applications inside of our particular stack and how we operate. We could test and validate the changes that we make inside of our infrastructure, make sure they're interoperable between our various infrastructural technologies, meaning Kubernetes and our virtual machines. And we could also dig a little bit into the future and determine what we need to do with future migrations. We could set expectations for how we expect services to work when they get migrated into Kubernetes. We are still building our knowledge within Kubernetes itself and how we expect the future applications to work within Kubernetes and then share the knowledge across the rest of our infrastructure team over the course of time as we expand more services into Kubernetes. So one of the things that we thought ahead when it came time to touching more of the next front-end services was the investigation and the expectations we set for ourselves with that migration. And one of the problems that we encountered was going to be the heart of this talk, which is something we learned about cloud costs and when you set up a regional cluster. So when we got started with Kubernetes, we went the route of using GKE and used their recommendations for setting up production-worthy clusters. So we related with the regional cluster. That was deployed using all the zones inside that region. So in this diagram here, I'm showcasing the fact that we've got the network traffic happening inside of our front-end stack. So when you operate inside of a single cluster and you've got a web service deployed, you're technically deploying a few objects. One is the web service itself along with all the web service pods and the services associated with those. And you're also deploying the Nginx ingress controller in our case, or you might have a configuration for the existing controller. As we're using a Google Cloud as our provider, we get a low balancer of some type. In this case, we utilize an internal low balancer. So with this, client requests end up going from our front door, which we utilize HAProxy for. And those requests are going to go through and route into the low balancer, which then get sent to the Nginx ingress controllers. At this point, we've got two potential paths in which network traffic can take. One of them being a very nice low latency, low cost path, the one being kind of a heavy weight path. Because a regional cluster is set up in a way in which you have web service pods, the knowledge of those web service pods across the entire cluster, the Nginx ingress controllers know about all those pods. Because of this, whenever a request comes into an ingress controller located, say in zone B, if that request lands on a web service pod located in zone C, two potentially bad things can happen. One, you have additional network latency because you've egressed traffic from one zone to another. It may be minor, but it might be something to keep in mind on. Another is the cost of the network egress from one zone to another. And this is where we at GitLab.com had a problem to solve. So let's dive a little bit into the costs associated with this. So all cloud providers have this problem where network egress is a little bit costly. So insert your favorite cloud providers pricing calculator and such here, but for Google, they charge one penny per gigabyte for traffic that egresses out of zones. So prior to us remembering the next big service, we were just doing some napkin math, 500 terabytes of data moving across zones is going to cost you over $5,000 a month. For GitLab.com and the amount of data that we move, we're moving petabytes of data across an entire month. And assuming a regional cluster with three zones, you're going to subject approximately two-thirds of that traffic potentially crossing those zonal boundaries. And that's going to get quite expensive. And this goes a little bit into how difficult it is to predict your cloud bill and estimate your potential cloud costs. You know how much a Kubernetes cluster is going to cost because they tell you it's going to cost X amount per hour. We know the nodes that back with Kubernetes clusters, you can kind of predict how much the instances are going to cost you over the course of month. GitLab.com is a very network data-driven workload. They're subject to what our customers send to us and the data that they are requesting from us. So knowing the costs associated with network egress is a little bit harder to predict, and it's kind of a surprising cost to be added if you're unaware of it. Now keep in mind that crossing zonal boundaries or egressing zones is not avoidable everywhere. So in this diagram I'm showing that web service pods eventually had talked to a database or file server to retrieve the data that they were requesting. But this is a cost that we just kind of have to succumb to. It's something that we are willing to take because of the way that we've engineered our lower end infrastructure to be as redundant as possible. However, the nginx ingress controllers, they're not really doing a lot of work here. They're simply proxying the data, and they might be doing a little buffering, but the important part is that they're taking the traffic from the nginx ingress controllers and kind of proxying it across all of these web service pods. Realistically, they don't need to know about all of those web service pods, but they just need to know how to route the traffic to ones that are closest to them. So if we could solve for that problem, we've got some sort of improvement specifically to our cloud bill. So as my talk title suggests, we decided to add a few more clusters. So given a region that has three zones, we simply created a cluster per zone. So now when traffic comes in through RHA Proxy front end, and they're spread across all of our zones as well, if your client request went through zone B in HA Proxy, you're going to get favored to a cluster that is also located in zone B. This means that the nginx ingress controller that's also located inside of that cluster only knows about the web service pods that are located in that cluster. So therefore, that traffic stays inside of that zone for the time being. With this aspect, we've completely eliminated the fact that an nginx ingress is going to route your traffic to an awkward zone which may incur some additional costs, whether that be the network or the cloud bill network egress cost. But again, like underneath the web service pods, when they go reach out the file servers, they'll still potentially cross a zone of boundary. And again, like I mentioned before, this will be a cost that we have to absorb. And there's a few benefits to having multiple clusters as well. You've got the potential where you have a situation for cluster upgrades. You could spread them out over the course of time. So instead of having a single cluster that gets upgraded, you have multiple clusters and should an upgrade fail, you've got the ability to spread or remove a cluster from rotation to fix it. HAProx in our case has knowledge of all of our clusters, so when we take one down for maintenance or we need to perform some sort of maintenance procedure, we have the ability to ship that traffic over to other clusters as necessary. So what other options are available if a multi-cluster configuration may not be available for everyone? One is a service mesh. Today, GitLab.com does not use a service mesh. The GitLab product does not use a service mesh. And this would have been a lot of work for us to try to figure out how to implement. It would have been a combination of figuring out which service mesh would have worked for us, learning how to use it, configure it, build up the observability tooling and all that stuff. So in this case, it was just a solution that we didn't want to go down. Another option would have been modifying our regional cluster configuration. There's many ways to configure a managed cluster inside of Google. But we saw the benefits of having more clusters greater than a single cluster, so we decided to just go down that route. And then today there's more features available than there were in prior when we first started looking into this. So Kubernetes has a feature called topology aware hints, something that was not available back then. But it might be worth checking out. So could you consider this configuration more complex? This is something that you could go both ways with. It's more complex in the fact that you do have more Kubernetes clusters. But if you think about a very well-balanced workload and the configurations associated with it, you're not necessarily adding more complexity for your infrastructure teams. The amount of nodes that are going to be running those same workloads is going to be roughly equivalent. The cluster configurations and the workloads that are defined, they're pretty much identical. You're starting to lean away from having clusters treated as pets, but you're leaning towards the clusters as cattle methodology here. I think the most strategic part that we had at GitLab was determining how we identify our clusters. Would that be a name or adding some sort of location tag to figure out how we want to observe and monitor our clusters appropriately? And then, of course, simplifies the cloud bill by lowering the costs. So some bonus things I want to touch on that we got out of having this configuration. One is a fun way to mitigate incidents. So this chart is showing our NAT port usage over the course of time. When we first created our additional clusters, we were asking all of our clusters to deploy new code at the same time. This had the down effect of after a period of time where we had grown into Kubernetes and started pushing more stuff into Kubernetes. We ran into a situation where every time we deployed, all of the Kubernetes nodes wanted to pull down the new image that we wanted to run all at the same time. This was kind of bad and results in NAT port exhaustion for us in this productivity use case. For us to mitigate this, all we had to do was simply modify our CI pipelines to change how many clusters we deploy at a time. So instead of deploying in parallel, it was deploying to one cluster at a time to give our infrastructure engineers time to fix that problem and then move forward where we could revert that change and deploy to as many clusters as we want to. Another one that I want to highlight, and this is something that we use very heavily right now, is the ability to test cluster configurations. So this is a chart showcasing pod counts over the course of a few days. You can see right in the middle of the chart that we made some sort of change where we're using far fewer pods than two of our other clusters. So we've effectively created a control and a test set of clusters where the controls are the two clusters that we did not touch at all. And we've got our test cluster where we deployed some sort of change to figure out whether or not we want to see positive or negative feedback here. And this could be anything, whether it be the way that you deploy the workload, the horizontal pod autoscaler configuration, it may be changing the application and the way it behaves in itself. But here we have the ability to compare with the workloads that we see in production how well the cluster is behaving. So in this particular case, we were trying to gain more efficiency by lowering how many pods that we were running for this particular workload. And the last thing I want to highlight is maintenance procedures. So we have the ability to remove traffic going into entire clusters as a whole. So this is a showcase of web requests coming in through all three of these clusters in this one region. So we had found an interesting bug on our Helm chart. And if we were to deploy a fix for the Helm chart, it would have been incident inducing. So in order to do that, we just created the maintenance procedure where we pulled traffic from that cluster entirely, deployed the fix, validated the fix was in place, and put traffic back into it and repeated that process for all of our clusters. So to wrap up, I think one thing that would be interesting to keep in mind is determining whether or not this is a problem for you. This was going to be a problem for us because we did start off using regional cluster and not knowing that network egress was going to be a problem between zones within a cluster was something that was not immediately prevalent to us. So keep in mind that the solution that we chose may not be for everyone. We went the route of having multiple clusters because at the time that we did this, we didn't have a lot of stuff inside of Kubernetes as a whole. It was easier for us to deploy more clusters than it was to look at other solutions. And also, consider if this is a problem with other workloads outside of Kubernetes. Like I mentioned, we've got this problem between our web service layer and talking to our file service and such. So this could be problems elsewhere within your stack. So I spoke really fast because this is my first time, so my apologies. So I've got plenty of time for a question.