 Hi everybody. I hope you're all having a great conference. Thank you for coming to my talk about Tamland or how GitLab.com uses long-term monitoring data for capacity forecasting. My name is Andrew Nudigit. I'm a Distinguished Engineer in the Infrastructure Departments at GitLab, where I focus on GitLab.com. Let's start with a bit of a backstory. The story begins in mid-2021. GitLab.com experienced a series of S1 database incidents related to high resource utilization on our primary Postgres cluster. Within the company, concerns were raised about the growth rate of GitLab.com and the ability of our primary database to support this growth. If the database hit maximum capacity, it might lead to further outages due to saturation issues. Vertically scaling the main Postgres cluster was no longer an option. It was generally agreed that the way forward would be to split up the main Postgres database into more than one cluster. Since a large proportion of our database traffic is related to CI, it was decided to move this feature to a second Postgres cluster through a process we called Functional Decomposition. This would be a complicated project spanning months. The fact that we wanted to keep the application running while migrating a huge volume of data to a new cluster made it even more complex. A plan was devised and a date set for the completion of the migration, but valid concerns were raised about whether, given the recent spate of Postgres incidents, the primary database could continue to support the growth we expected to see until the Functional Decomposition project was delivered. One option would be to bring the project delivery days forward, but doing so would add additional risk to the project, forcing corners to be cut and possibly leading to further instability. It would be much better if we could confirm that the database would have sufficient capacity, allowing us to stick to the original date, giving the team the space to carry out the migration with due caution. We needed an accurate and data-driven assessment of the available capacity in our main Postgres cluster. Using a new tool called Tamland, we were able to carry out a capacity review. With Tamland, we analysed all potential saturation points within the cluster and plotted their individual growth forecasts to predict which, if any, would hit saturation. We found that if a few small changes were made in the short term, we'd have enough runway to support the delivery of the Functional Decomposition over the multi-month time frame the project was expected to take. So it turns out that this is a pretty boring story. As predicted, we had enough capacity to perform the migration safely, the migration got completed in time, without hitting capacity issues that we were being concerned about, and everyone lived happily ever after. But in software infrastructure, boring is normally a good thing. The point is that we were able to manage risk around an already risky project and control uncertainty. This eased some of the concerns around the project and gave our executives the confidence they needed to now project delivery dates. I found this quote from a paper called SRE Best Practices for Capacity Management by Louis Casada Torres and Doug Colish, and I think it captures the goal of capacity management really well. The goal of capacity management is controlling uncertainty. In the midst of the unknown, the service must be available now and continue to run in future a challenging but rewarding and delicate balance of tradeoffs in play, efficiency versus reliability, accuracy versus complexity, and effort versus benefit. Before I continue, I'll define a few terms that I'm going to use throughout this talk. The first one is the definition of a resource. A resource can be anything that can be consumed and has a limit. A lot of capacity planning literature focuses on hardware resources for example, CPU, memory, and network bandwidth. And these are, of course, very important examples, but there are a lot of other software-defined resources that are equally important to our capacity planning process. Some examples include open file descriptors, cloud quotas and rate limits, 32-bit ID address-based utilization and SQL tables, Postgres auto-vacuum utilization, NAT gateway utilization, global interpreter lock utilization in Ruby, database connection pools, NCPU thread utilization in Redis, and node pool size utilization. These are all examples of the types of resources we monitor, track, and forecast on GetLab.com. By combining all of these, we're able to build an accurate capacity forecast for our services. Four more terms that we need to define are the first being utilization. Now, this should be fairly evident. It's a measure of how much of a given resource is being used or consumed at any moment in time. This always has a unit of measure, for example, open files, bytes, database connections, etc. Capacity is, for any given resource, the maximum utilization possible. For example, the maximum number of open files that a process is allowed, the total capacity of a disk in bytes, or the maximum number of database connections allowed in a pool. Utilization percentage is the utilization expresses a percentage of the capacity of a resource. Finally, saturation is what happens when the utilization of a resource nears or reaches a capacity of that resource. In some cases, saturation, or at least prolonged saturation, will adversely impact the performance of the system. For example, through increased queuing, higher latencies, errors, or possibly complete system failure, planning for and hopefully avoiding saturation is the goal of capacity planning. Here are some examples of saturation events and the typical times that a full mitigation might take. In some cases, the low impact of saturation or the ease of mitigation means that we don't need to focus on these items as a priority in our capacity planning process. For example, in Kubernetes pod CPU saturation should be mitigated automatically within minutes by the horizontal pod autoscaler. So it's unnecessary to focus on these resources in a capacity planning review. However, there are many other resources that once they've hit saturation have mitigation times stretching from days to weeks or even months. The story I described earlier about functional decomposition of our Postgres cluster is one such example. If one of these resources unexpectedly became saturated it could have a detrimental impact on the availability of the system for weeks or even months. That could mean that requests to the service start experiencing latency issues or the application could fail completely. For example, in the case of a Postgres transaction ID wraparound event in which reaching saturation will automatically shut down the database and a disaster recovery operation will need to take place. The goal of our capacity planning process is to focus on resources with high risk or long mitigation times. Let's discuss the general approach that we use to capacity planning on getlab.com before diving into further details. A lot of capacity planning advice that I've seen focuses on the system as a whole using demand signals such as monthly active users or requests per second. They talk about the current utilization in terms of these signals and the total capacity expressed in the same terms. In other words, the current and maximum utilization of the entire system are represented by a single metric. You could call this black box capacity planning in that we treat the capacity as fairly opaque with the capacity planning taking place outside the system and a single metric defining the utilization and capacity of the system in its entirety. The difficulty in this type of capacity planning process is that it requires the maximum utilization capacity of the system to be estimated. There are three common ways of doing this. Guessing, load testing, and modeling. Let's discuss each. The first approach and by far the most widely used is to guess what the maximum... The first approach and by far the most widely used is to guess what the maximum... The first approach and by far the most widely used is to guess what the maximum utilization capacity of your service is. This is known as the back of the envelope method or even the seat of your pants method. It's not scientific and it's not based on data, only intuition. It has many problems. It's almost certainly not going to be correct. You can't systematically improve the process behind it so it never gets any better. It's not scalable and you can't automate it. This means that you're less likely to carry out capacity planning since it's a manual process. The second approach to determining utilization capacity is through load testing. This is much, much better than guessing but this approach also has several downsides. If you're running a large system, setting up an appropriately sized test bed can be prohibitively expensive if possible at all. Additionally, you'll need to create a large cluster of clients. Additionally, you'll need to create a large cluster of clients to generate the workload. This adds to the cost and complexity. The second problem is that it's easy to overlook failure modes which may have a major impact in the real world. For example, pathological clients generating unexpected workloads possibly due to abuse or being incorrectly configured. Building a test suite to generate an appropriate workload can be very challenging too. A third approach is using modeling. This approach involves building executable performance models possibly in a spreadsheet, Jupyter notebooks or regular programming language. The model attempts to capture the performance and scalability characteristics of the system. This approach can be very complex. Additionally, it can be very difficult to validate whether the model is predicting the maximum utilization correctly. The first approach and by far the most widely used is to guess what the estimated capacity of the system is. This is also known as the back of the envelope method or even the seed of your PANTS method. It's not scientific and it's not based on data, only intuition. It has many problems. Firstly, it's almost certainly not going to be correct. You can't systematically improve the process behind it so it never gets any better. It's not scalable and you can't automate it. This means that you're less likely to carry out capacity planning since it's a manual process. The second approach is determining capacity through load testing. This is much, much better than guessing but this approach also has several downsides. If you're running a large system, setting up an appropriately sized test bed can be prohibitively expensive, if possible at all. Additionally, you'll need to create a large cluster of clients to generate the workload and this adds to the cost and the complexity. A second problem is that it's easy to overlook failure modes which may have a major impact in the real world. For example, pathological clients generating unexpected workloads, possibly due to abuse or being incorrectly configured. Building a test suite to generate an appropriate workload can be very challenging too. The third approach is using modeling. This approach involves building executable performance models possibly in a spreadsheet or Jupyter notebooks or a regular programming language. The model attempts to capture the performance and scalability characteristics of the system. This approach can also be very complex. Additionally, it can be very difficult to validate whether the model is predicting the utilization or the maximum utilization correctly. So, estimating capacity is really hard. To make matters worse, the application is evolving very quickly as is the production environment. In a monolithic application such as GitLab, multiple teams may be making changes to the environment and the application simultaneously, making it even more difficult to control for change. Additionally, experiments are running, feature flags are being toggled, and user activity patterns are constantly changing. Any maximum capacity estimates that are produced quickly go out of date and lose their meaning, constantly lagging behind the true state of the environment. This means that the capacity estimates need to be constantly revised through frequent updates. Unfortunately, none of the methods for capacity estimation are easy to perform and efficiently automating them is very hard or not possible. The approach that we use on GitLab.com is closest to modeling. However, instead of trying to estimate the capacity of the system as a whole and expressing that using a single demand indicator such as active users, we treat each service as a collection of related resources and monitor and track them independently. So, the resources for each service are treated as a collection of independent but grouped capacity planning metrics. All of these metrics have different growth rates, thresholds and periodicity. Each new change introduced to the system, be it through application changes or infrastructure changes or even user activity, will impact the resources differently. We do not attempt to aggregate these resource capacities back into a single system capacity as this is error prone and difficult to verify. Unlike when attempting to estimate social system capacity, estimating the capacity of individual resources is fairly straightforward. For example, for disk space, the capacity is the disk size. For open file descriptors, the capacity is the maximum number of file descriptors. And for database connection pools, the capacity is well defined in configuration. The biggest downside to this approach is that we have to track and forecast a lot more metrics, but many services share the same types of resources. For example, many stateful services will share disk space and inode resource types. Many Ruby-based services will share a resource metric representing global interpreter lock utilization. We can therefore think of resources and services as being in a combination matrix, something like what's shown on this slide. The downside to this approach is that instead of having a single utilization that we have to monitor and forecast for each service, we have between 10 and 20 per service. This generates a lot more data that we need to manage and perform capacity planning for, but luckily this is something that we can automate. To further explain, let's dive straight into the implementation, as I think this will help illustrate how this approach works. So in order to build a system that can monitor many different types of resource utilization across different services, we first need to normalize and standardize the resource utilization metrics that we use. We do this by constructing recording rules in Prometheus. All the resource utilization recording rules use the same name, GitLab Resource Utilization colon ratio, but they have different labels to distinguish them, one label for the name of the service and another for the name of the resource. Each combination of service and resource will have one series. The value of these recording rules is between zero and one, with zero being 0% utilization and one being 100% utilization, in other words, at capacity. These recording rules are fairly easy to construct. For any given resource, they are simply the current utilization divided by the current capacity. Here are three examples from the GitLab code base. For GCP quota resources, we obtain current quota utilization and current limits from a GCP quota exporter for Prometheus. For open file descriptor metrics, many Prometheus client libraries automatically publish these metrics out of the box, so they're very easy to obtain. A third example demonstrates how persistent volume claim disk usage can be monitored using metrics exported by Cubelet. This approach results in a lot of utilization metrics being generated. Using the open file descriptor utilization metric on gitlab.com, there are close to 5,000 series for this at present. This is too much data. We need to reduce it down by aggregation so that we can have a single representative metric per service resource pair. Now, there are many different ways that we can do this. For example, we could use an average or quantile, but in most cases, we find that the best way of aggregating these metrics is by the maximum value. This isn't always the case. For example, for some Kubernetes metrics, where we might have very high cardinality, we might use the 99th percentile removing our client values. However, in the majority of resources, we aggregate using Max. The reason we use this is that the highest utilization is the one that leads to saturation and instability in the system. For example, if you have four servers in a cluster and three of them have very low disk utilization, but one of the servers has a disk that's almost full, then that's the signal that we want to aggregate. The average disk space across the servers is below 50%, which seems fine, but one of the servers may be close to failure, and by using Max, we can pick up the worst case for alerting, monitoring, and forecasting. Taking the same example of open file descriptors as we used previously and aggregating on the Max for each servers significantly reduces the volume of data, giving us a single signal per service. If any service has a process nearing saturation for its open file descriptors resource, we'll have a clear signal of it, which we can visualize and alert on. Each of these resource service combinations has a set of recording rules, alert threshold values, alerts, and Grafana dashboards associated with it. Early on, this configuration was hand coded, but this soon became unmanageable with too many combinations of resources and services. To get around this, we migrated to a configuration language called JSONet. This allows us to manage a single configuration for each resource, including metadata, such as documentation, or whether or not the resource is horizontally scalable. This slide shows an abbreviated example of that configuration. We can then validate the configuration before using it to generate all the recording rules and alerts. For dashboard generation, we use a JSONet library maintained by Grafana called Grafanaet. This allows us to generate all our dashboards from the same configuration and ensure that they're always up to date. In fact, most of the dashboards for GitLab.com are now generated from a Grafanaet definition. This slide shows the main utilization dashboard for a service. In this case, a primary Postgres cluster. Each of these series represents the worst case for a given type of resource. This allows us to quickly review resource utilization across dozens or even hundreds of resources that are being consumed by the service. If one of the metrics requires further investigation, we can navigate to detailed dashboards that are generated for the resource. This is what some of those panels look like. For each resource, we generate a full Grafana panel. These six resources belong to our main Postgres instance once again. The aggregated recording rule value is represented by the thick yellow line and the de-aggregated underlying values are visualized alongside those two. This allows us to quickly dig down into the source of a utilization or saturation issue and determine the problem. Of course, we generate alerts for each utilization metric too. Generally, the alerting threshold for each resource is somewhere below 100%. The actual alerting thresholds depends on the nature of the saturation metric and the way in which it's measured. In particular, the time period over which samples are collected. These thresholds are used for real-time monitoring alerts but also for forecasting capacity issues as we'll see later. Now that we're collecting utilization data for resources in a normalized form, we can start analyzing it over long periods of time looking for trends. Luckily, GitLab.com uses Thanos for long-term metric storage. This means that all the resource utilization data that we use for short-term monitoring of GitLab.com is retained for several years in object storage. Here's an example of resource utilization data over a long period showing changes in trend and a steady growth towards saturation. Reviewing this data made us confident that we could potentially be using it for forecasting. We started looking at how we could leverage this data for planning purposes by forecasting future potential saturation. Our first attempt at forecasting was not a huge success. We attempted to use linear regression to predict future growth. Prometheus has a predict linear function that we can employ so our first somewhat naive attempt at building a forecasting engine was built on this function. Since the data could be processed within Thanos, this approach was very easy to do. It was very easy to implement and to experiment with. We could see promise in the approach but unfortunately the forecasts were very inaccurate. One of the main reasons for this is that linear regressions fail to take seasonality into account. In this example, the linear regression is able to determine the single long-term growth trend in the utilization data but fails to take into account weekday peaks averaging the trend with weekend of peak times. Any changes in trend possibly due to application or environment changes are also averaged out into a single growth trend based on the entire historical data set. We knew that we would have to go back to the drawing board. Luckily around this time we became aware of an open source project out of meta called Profit. It's a library written in R and Python for performing forecasting and predictions. It seemed like a much better fit for our purposes than linear regression could ever be and we started experimenting with a proof of concept. It quickly became clear that it was well matched to the task at hand. Here are some of the reasons we rarely like Profit. Forecasting is notoriously difficult to do. It generally requires specialist skills. Profit has been designed to make forecasting easier without the need for data science expertise. Profit works on time series data and can produce very good forecasts without needing specific tuning and customization on a per series basis. This allows us to scale it up to hundreds of forecasts. It's very fast at generating forecasts too which also helps when you have as many as we do. It's able to recognize seasonal patterns for example traffic activity over hours in a day or days in a week. Our utilization data is strongly seasonal. We see the same traffic patterns week in and week out. Profit is also able to handle outliers and missing data really well. Finally it can detect changes in trends and adjust forecasts accordingly. We jokingly called the proof of concept Tamland after Brick Tamland the kind but simple-minded weatherman from the Anchorman movies played by the actor Steve Carrol. We selected the Python version of the Profit Library because we have more expertise within the team in Python than in R. Profit is well suited for use with Jupiter notebooks but we wanted to automate the process of generating the reports so we used the Python library called Jupiter Book which is designed to generate static websites from Jupiter notebooks or from Markdown documents. The utilization data is combined with metadata such as service catalog information and resource metadata to augment the forecast with useful context and descriptions. In a GitLab CI pipeline we import the utilization data from Thanos, run forecasts and generate a static site including plotly graphs showing trends and forecasts over time. We then publish this site using GitLab Pages. This pipeline is run automatically on a weekly basis. This is an example of what the weekly report looks like. It's a fairly standard static website. On the left we have links to the various services running GitLab.com and then clicking through to a service we presented with the capacity planning forecast for that service for each monitored resource. Here's an example of a resource from a service in the report. At the top we have a description of the resource being monitored. This is important because in some cases such as for inodes it's very easy to understand the utilization metric but in other cases some of the resources that we measure can be pretty abstract and having a clear explanation of what's being tracked can really help. Next we have some deep links to Grafana dashboards so that the operator can quickly navigate to access more information. After that we present some dates about when the resource is predicted to violate its alert thresholds and also the 100% threshold. This is used for prioritization and alerting. Finally we have the most important component the forecast chart. This is a time series chart plotted over a 270 day period that's six months into the past and three months into the future. Because of that the current date is always two-thirds across the time series. For the forecast we present the median confidence line and an 80% confidence interval range around that. Depending on the variance in the data this confidence band will be wider or narrower. Let's take a look at some examples. This chart measures our total Git disk utilization across all 80 or so of our Git servers. Recording this as a resource metric gives us a pretty good idea of when we'll need to provision additional Git servers. The provisioning of these servers doesn't take particularly long but there can be a lead time in getting quota limits raised for storage so it's important that we're planning ahead for these events. Here's an example of what profit calls a change point. A change point is an inflection when a trend changes. In this case we see a sudden step up in resource consumption. Because we don't have much headroom on the service and the lead time to mitigating this saturation is long the decision was made to investigate what had led to this change. A regression in the application was discovered and corrected avoiding a saturation problem further down the line. Here's one more example. This resource represents Redis memory on our Redis session storage cluster. The cluster stores session state. If we hit saturation Redis will start to evict old session state with the least recently access being evicted first. If that happens you may need to re-log into the GitLab web application sooner than you otherwise would have needed to. This is caused by a bug in the GitLab application coupled with some unusual user activity. The issue on the right was created around about the time of the event and demonstrates how the team were able to pick this issue up very quickly within a short period of time from the start of the event and long before the issue became critical. Unfortunately, in this case the forecast wasn't accurate in that it didn't predict saturation because it had not yet picked up the change point but it was picked up via manual review with plenty of time to mitigate the problem. Here are some figures from our current deployment of Tamland. We monitor about 360 different service resource combinations but this is growing all the time. Coincidentally, we've investigated about the same number of potential issues within the last 12 months. The Tamland report generation job takes around 90 minutes to run on a dedicated N1 standard egg runner. This job does a lot of caching to speed up retrieval of historical data from Thanos. Tamland is an aesthetic project and we're improving it all the time to make it better and more useful for forecasting for engineers at GitLab. Here are some of the improvements that we're considering implementing in future. Tamland started off as an internal project and it was highly coupled to our specific infrastructure and monitoring systems. We'd like to continue to evolve the project in a way that would allow other people to benefit from it. This would require decoupling it and allowing more configuration options. Further improving the accuracy of the predictions is going to be a perpetual goal for the project. One of the ways that we might consider doing this is by allowing customization of forecasting and alerting parameters for different resource types. Another goal is better ownership to shift the ownership left to product teams. We can do this by sending alerts directly to those teams and finally it would be worthwhile evaluating whether other forecasting libraries including Neuroprofit which is a PyTorch neural network forecasting library based on profit our LinkedIn's Greycott Library could work better. In summary, we use Tamlan to generate a weekly capacity planning report. This capacity plan is tightly coupled with our engineering scheduling processes. The same metrics that we use for short-term resource utilization monitoring and alerting also get used for long-term forecast thanks to our ability to retrieve them from Thanos. Profit is a very useful tool which makes forecasting very easy. I would highly recommend you give it a try for forecasting. If you're interested in exploring further here are some links to the Tamlan source code. Unfortunately, the report itself is no longer public due to restrictions in publishing forward-looking statements now that Kitlab is a public company but all the source code is still available and of course if you're interested in this sort of thing please reach out. We have a large number of vacancies open in the infrastructure team and that's the talk thank you very much I think we're now going to have some questions.