 Welcome. This session is How We're Dealing with Metrics at Scale on GitLab.com. My name is Andrew Newdigit and I'm an engineer at GitLab where I work in the infrastructure team and I help build GitLab.com. This talk is about how we've scaled our monitoring to support a site that has, over the past few years, grown rapidly in size and complexity. To illustrate that growth, here are some figures to show how things have changed since I joined. Back in 2017, we'd only recently adopted Prometheus and were migrating off in FluxDB. We had a single Prometheus server, six infrastructure engineers, a handful of alerting rules and recording rules. We only had 21 dashboards and we were processing about 100,000 samples per second. Roll forward for a number of years and we now run a Thanos Federated cluster deployed into Kubernetes using Tanker. The infrastructure department is around 40 people, so six times bigger. We have over 2,600 recording rules, 400 refiner dashboards and we're ingesting about 2.8 million samples per second. So it's important to state this, what worked for us then worked fine for us at that scale. It was the right solution at the time but that approach wouldn't work for us now and this talk is about some of the tools and techniques that we've used to go from that scale to where we are today. So what prompted our efforts to improve our alerting? We were seeing numerous problems that indicated that our approach to monitoring was no longer working for us. One of these problems was low precision alerting. By this we mean that the proportion of alerts that was actionable was low and we're seeing a high number of false positives. At any time, many of the alerting rules inadvertently generated low quality, unactionable, flappy alerts. Very often the engineer on call would determine that users were not being impacted, that everything seemed okay and they would acknowledge the alert and effectively ignore it. Not only was the precision of our alerts very poor but so was the recall. Recall refers to the proportion of user impacting events that are detected by the alerting system. This means that instead of finding out about incidents through an alert we would sometimes be made aware of the incident by people rather than the software that we'd built to detect these incidents. In other cases the alert would fire but too late and we already knew that there was a problem and now it was just extra noise while we were trying to solve the issue. Yet another problem we found was that the dashboards were very often broken and not working as we expected. Since our dashboards were not managed alongside our other metrics there was no way of validating that they were still working until we took a look at them. And this often happened during an incident. So now instead of having one problem we had two in that we had to fix the dashboard before we could fix the problem. One of the things that we began to realise was that having three distinct configurations for our metrics stack was part of the problem. The source of metrics was independent from alerting and recording rules. Our alerting and recording rules were managed independently from our dashboards and our dashboards were installed in Git and they didn't have any form of change control and they weren't validated. With this in mind we set out to improve our stack with these goals. One, to develop a common monitoring strategy across all of our services based on a set of key metrics and service level indicators. Two, use those metrics to improve the precision, recall and detection time of our alerts and three, unify our metrics, SLO, alerting configuration, recording rules, dashboards, everything into a single source to avoid inconsistencies between the definitions. Let's look at how we tackle the first goal of building a set of key metrics for our application. We based our metrics on Google's four golden signals but made some changes to benefit our requirements. For latency we measure aptX as a ratio rather than a percentile duration measurements in seconds. Requests and errors are measured at a per second rate and saturation is measured as a percent, lower being better. Saturation is a pretty big topic on its own so I'm not going to go into detail in this today. If you're interested in finding out more here's a plug for a talk I did on saturation monitoring on GitLab.com I've included a link to the slides. With our key metrics decided on, the next step was to break the application down first into a set of services and then break each service down into a set of components. So for example we modeled web, Git and API services and then broke these down further into one or more components. For the Git service for example we have SSH and HTTPS components. Each component has three key metrics, aptX, errors and requests. So for some components, sorry pardon me, for some components it's not always possible to measure latency directly so aptX is optional in those cases. From our three key metrics we're able to derive two service level indicators or SLIs. An SLI is normally expressed as a percentage of requests that are bad and aptX is really the inverse of that. It's a percentage of requests that have a satisfactory latency or good. Because our organization was already using the concept of aptX we decided it would be better to adapt our monitoring system to the organization rather than the other way around. Therefore our aptX SLI is an inverted SLI with 100% being the best service level. For errors we use a conventional SLI definition with 0% being no errors and the best service level. Once we had our approach to monitoring our key metrics in place it was time to start thinking about our second goal to improve the quality of our alerting. As I mentioned for each component we derived two SLIs, aptX and errors. For each of these we set a service level objective and trigger an alert if an SLI is violating its SLO target. If our aptX is below the SLO threshold or our error ratio is above SLO we trigger an alert. For some services we also trigger anomaly alerts for high request rate anomalies. The original approach we took to alert was any violations over a 5 minute period. For example if you have a thousand requests in a 5 minute period and two of those requests result in an error two in a thousand gives you 0.2% error rate and if you have a 99.9% SLO this 0.2% exceeds the 0.1% threshold causing the alert to fire. Unfortunately this is a very naive approach and it has very poor precision in that it generates a huge number of false positives. Taken to the extreme the alert could fire hundreds of times a day yet the SLI could still achieve its SLO. In fact our new alerting was no better than the old alerts that we were trying to improve on. We went back to the drawing board and looked for better alternatives. We settled on using multi-window multi-burn rates alerts instead. I'm not going to go into the details in this talk but if you're interested in knowing more Google have published an excellent guide in their SRE workbook I've included a link on this slide. This approach has provided us with high precision, low detection time and good recall on our alerts. The problem with this approach is the amount of complexity it brings. For each component that we monitor we need 12 recording rules to be correctly configured. With dozens of components you really need a configuration tool to help with this as doing it manually would be very painful. So before we could roll out our SLO alerting with multi-window multi-burn rate alerts we needed to investigate better tooling to deal with all the repetitive configuration that was required. For each service we may have several dozen similar but slightly different recording rules. In our dashboards we might have other queries that are also similar but use different aggregations. Changing these queries was an error-prone process. So we started thinking about what tools we could use to make this process easier. The idea we had was to describe all of our metrics in what we call the metrics catalog. This is an abstract configuration. It's written in JSON and it's designed to be user-friendly, validatable and with as little repetition as possible. The configuration is stored in Git. Changes manage through merge requests. On commit we use CI to validate the config, generate new Prometheus recording rules, alert configurations and Grafana dashboards amongst other resources. This is what a typical entry in the catalog looks like. This definition is from the web service and shows one component from that service called Workhorse. We define our SLOs as well as our aptX, request rates and error rates that will be used to generate the SLIs. These definitions are then used to generate Prometheus expressions in our Prometheus recording rule configuration as well as dashboards and everything else. As you can see, this generates lots of very similar but slightly different Prometheus configuration depending on the burn rates that you're evaluating. The last part of our goal was to generate our dashboards too. Now as it happens, the Grafana team have built a JSON library called Grafana. We could use it to automatically generate our Grafana dashboards from the metrics catalog. This is a typical example of one of our generated dashboards. This is from our web service dashboard and what I really like about these dashboards is the consistency. We have dozens of different services and for each service the dashboard layout, the color scheme, the data presented is consistent. The top row of each dashboard provides an aggregation of all the SLIs within that service and this is followed by a row for each component of the service with charts with key metrics and collapsed rows containing even more detail. Once we had our SLO monitoring in place the next challenge we faced was making SLO alerts easier for operators and engineers to understand and in particular reducing the time to diagnose this on our SLO alerts. One of the big differences between the old way that we alerted with causal alerts and our SLO alerts is that when an SLO alert fires it's not always immediately apparent what the problem is. It's up to the operator to understand the SLI then investigate the problem by digging through metrics, logs and other signals until the cause becomes apparent. So our goal here is to give the operator the tools to navigate from an SLO violation signal back up the stack to the cause of the problem. Here's an example to illustrate why our existing tooling was insufficient for our needs. Alert Manager has a feature that provides a link to the Prometheus UI pre-populated with the query that caused the alert to fire. It's called Generator URL. Before we moved over to SLO violation alerts we relied pretty heavily on this feature. Each alert would include a link to the expression that caused it. In the Prometheus UI we would manipulate the expression by adding labels, changing selectors or changing aggregations until we could spot the problem. What we found with SLO alerts is that this approach doesn't work very well. The problem is that the recording rules that we used in the expression are highly aggregated and it's likely that the labels which may have been useful in an investigation have been removed. Unfortunately there's no quick way to navigate back to the source expression from a recording rule. Arriving at a chart of an SLO burn rate expression like this often led to more confusion for engineers instead of clarity. We needed to create a better initial experience for the operator following an alert. The way we addressed this was to take advantage of the metadata present in the metrics catalog. Since we're generating the SLO alerts and the dashboard from the same source we can include deep links to the appropriate Grafana panel and these can be embedded directly in the generated alert definition. By navigating to the dashboard from the alerts the operator is immediately provided with context around the alert signal, the thresholds the status of other SLIs in the same service and links for onwards investigation. In our generated dashboards for each component we include this set of links to other observability tools that we use to assist in deeper investigation into problems for that component. Like everything else these links are described in the metrics catalog. They include links into stack drivers, sentry, Kibana searches and visualizations amongst other things. They're presented directly alongside each component in the Grafana dashboard. So when arriving at a dashboard from an alert we get an easy experience for the on-call engineer to continue their investigation through our other observability tools. The final part of this talk describes the challenges we've experienced scaling SLO monitoring from a single Prometheus server up to the Thanos Federated cluster that we use today. This approach is very straightforward by describing the simplest approach to SLO monitoring and that is by using a single Prometheus instance for monitoring the entire application. With this approach all work to collect metrics aggregate them into SLIs and evaluate those SLIs against service level objectives is done in a single Prometheus server. This approach is very straightforward and easy to operate but it's limited in how far we are able to vertically scale a single Prometheus instance. Once we hit that scaling limit the next logical step is to break the monitoring down into multiple siloed Prometheus instances. With this approach the data for each SLI is fully processed within a single Prometheus instance so it can continue to collect, aggregate and evaluate SLIs in a similar manner to before except that each Prometheus only contains a distinct subset of the SLIs. In Grafana we use multiple data sources to visualize data across different sources. The advantage of this approach is that it remains fairly simple to both deploy and understand while allowing us to scale Prometheus horizontally. One limitation to this approach though is that all the metrics required to evaluate an SLI must be contained within a single Prometheus instance and unfortunately this requirement became problematic for us. This happened when our Kubernetes migration project kicked off. As workloads migrated to Kubernetes some SLIs were split between Prometheus instances used for VMs and new Prometheus instances contained within our Kubernetes cluster. This was made worse by the fact that we decided to employ three zonal Kubernetes clusters each with their own Prometheus instance. So instead of metrics being collected in a single Prometheus instance some of our SLIs were now being split between up to four different instances. The problem with this is that they may be local SLO violations but when aggregated across the entire application the service level objective is not being violated. This leads to a series of low precision alerts which we nicknamed split brain alerts because they were only applicable to single Prometheus instance not the entire cluster. A second problem with having SLIs split between multiple Prometheus instances is that it becomes difficult to get a global view of an SLI since we need to combine data from multiple sources in our visualizations. The solution we used to address this problem was to deploy Thanos. Thanos is a CNCF incubating project I'm sure many of you know of it. Thanos provides single view across multiple Prometheus instances. It also has a component called Thanos rule which can be used for evaluating recording rules against the single view. This provides us with a mechanism to aggregate across multiple Prometheus instances. Thanos rule will also evaluate alerts using the same approach as Prometheus except that it evaluates using the single global view once again. To use Thanos rule we broke our SLO recording rules into two parts. Most of the metrics processing remains in Prometheus. Here we convert potentially higher cardinality application metrics into low cardinality key metric constituents. Then in Thanos rule we sum the key metric constituents across all instances before calculating global app decks and error rates SLIs. These are evaluated against SLOs to provide alerting on globally aggregated values doing away with the problem of split brain alerts. This example configuration shows how we aggregate multiple Prometheus metrics in a Thanos rule using recording rules. For each of our key metrics we aggregate all our Prometheus metrics whilst being careful to exclude any previously evaluated Thanos metrics. The first recording rule aggregates the error rates SLI the second shows operation rate and the third recording rule uses the two previous values to create a global error ratio SLI. Note that we use monitor equals global as a Thanos selector to control whether to include or exclude globally aggregated metrics in these expressions. Another important point is that the partial response strategy is set to warn instead of the default which is abort. The reason for this is that when partial response is set to warn if a single Prometheus story is unavailable the aggregation won't fail. Instead, the metrics from that Prometheus instance will temporarily not be included in the aggregation but this is a better trade-off than losing all the metrics. We work around this by monitoring for partial response warnings in our monitoring stack. In conclusion, here are some of the ways we've learned to deal with metrics at scale. Firstly, we define key metrics for each service component. We manage complexity and repetition by using an abstract definition in the metrics catalog as our single source of truth. We migrated to multi-window multi-burn rate SLI alerts for improved alerting. We generate our dashboards to ensure that they're kept up to date and validated. We focus on improving on-call engineers experience because SLI alerts are not always intuitive. And finally, we federated our service level monitoring using Thanos and Thanos rule. One last point, if you're interested in learning more, I highly recommend reading these fantastic resources on monitoring in general and SLO monitoring in particular. Finally, all the code for our metrics catalog is available on gitlab.com in our runbooks project. I've included a link here. Thank you very much.