 All right, let's get started. Welcome to the Prometheus team functional group update. Let's start. All right, so my name is Ben Koche. I'm the Prometheus team lead. We have Joshua, Jose, Mike, and Paul on the team. In the last couple of months, it's been a while since we did an update because of the summit in 10.1 and 10.2. We spent most of our time working on front-end display issue cleanup. Got through a bunch of nice bug fixes there. In 10.2, we started working on the InFlexDB feature parity. And while we have some of the feature parity in, we've had a little bit of problems. So we'll be talking about that next. Some of the things that we are trying to do InFlexDB did not directly translate. And we had to disable a few things that were a bit excessive for the Prometheus metrics. And we have been having a number of issues with the Ruby gem that generates the metrics. We are still working on this, is close to being done. And I will hand this over to Paol to talk about. Okay, thanks, Ben. Can you switch the slides for me, please? Okay, this slide is okay. So to give a quick rundown, we run our Ruby processes, like our GitLab in a multiprocess setup using Unicorn. It means that on each server, we have multiple processes each handling all the requests. So we want to gather metrics from all those processes and display it for Prometheus server to be able to scrape those metrics. To do it, we went with a scheme where we save all the metrics in a file on our file system, preferably in a memory-backed file system, like temp, so it's faster. And then the process that handles the requested to collect the metrics processes all the files and sends the reply to the client that requested it. Okay, next slide, please. So we had some problems with that. One of the first, like mostly with metric corruption, one of, there were a few reasons because why the corruption happens. One of those was because we add, like whenever Ruby code adds another metric or creates like new measurement, but with more, just simply adds another metric, it needs to add a line to the file. Sometimes the file gets enlarged. Like we start with 4,000 kilobytes and then we go up to like 15, 16 kilobytes. Mostly it works, but sometimes it didn't. Turns out they're internally a map gem that handled that at one point, had the bug that didn't update internal information about the file size and that gets triggered. Why it only triggered some of the time, I'm not sure because I haven't traced that, but just adding one that the storage line seems to fix that. So yeah, that's working now correctly. Okay, next slide, please. Is there any questions about any of these? Like I'm happy to answer. So yeah, the other problem we had, we needed to, I'm thinking there was some bug in this slide. Anyway, one of the first problems we had with gem actually was that we created tons of metrics files on staging, like 40,000 or 30,000 and then we needed to process those files. It was because each file was created for a unique PID, like the worker process was spawned, it had a PID, it created a metric file, then the process died. It was respawned by Unicorn Master and the process continued until we had like thousands of files. So to combat this, we implemented worker ID instead of PID. Worker ID is like internal information from Unicorn, which ID the worker has. So like there's eight unique workers. So at one point each process would have one ID. So this was all fine, but the way to get those out of Unicorn was buggy. So we got buggy workers ID, so the worst clashes that turned out that sometimes the process would start writing on another process file, because it thought it had ID zero. The other process also thought that it had ID zero. So now we've locked the files using flock, which is fine. It works, but the problem is that sometimes that block got inherited by the, block was created in master process then got inherited in the worker process, which is another slide. This is I think most of the problems that we encountered was due to preload the app, like because it changes how the chainization is done, especially changes compared to what you're working in GDK. And yeah, it's mostly resulted in file corruption, like because the worker processes had wrong information about which file to write because they inherited that information from the master process. So yeah, the final solution to that is was turned out to be checking if the pitch change in every operation because otherwise there's no way to just, to be able to execute codes just after fork before any other code is executed because there are background threads that are already running. They can already start corrupting the files. So that's handle right now. Okay, next slide, please. Yeah, this is mostly right now we are in a phase where we try to optimize some of the operations. One of those was metric initialization. Once we added a lot of inflex DB metrics, it's turned out they slowed down the system of it at least for the initialization part. So when that first created, it took a little bit longer because, and then it affected how long the first request to a resource is, how long it takes to process the first request. Yeah, because there were some locking issues, like we had the locking, so no other operation can be done on the file at the same time when we add another entry. So like concurrently another thread cannot add the same entry in the same place and just get corruption again. So yeah, but to fix that, we re-implemented this in C and took advantage of global interpreter lock. And also it's faster because it's C, which is another slide. So it's turned out after enabling all the metrics that we have like a lot of metrics to process and if this issue is exaggerated by the fact that we have eight or 20 workers, I think we have 24 on production, eight on dev. And this multiplies the amount of metrics we need to process on each request to display them. Yeah, and Ruby is too slow for that. Even using, parsing, we use JSON internally to store the fields, like there are small JSON like four entries in a array. And parsing the JSON for the 200,000 metrics takes about five seconds only JSON. So yeah, after optimizing it using simpler JSON library and switching to C and optimizing some of the algorithms, we optimize it like from 27 seconds to around to 200 milliseconds. I already get the same results from the C code and most of the parsing is in C right now. So sometimes the solution is not to use Ruby. Sadly, well, yeah. This way, we are now able to get it fast enough so we can process even more metrics as needed. So we will have a room to grow and add more metrics on top of the 7,000 we currently have. Okay, I think that's all from the, and that's still in progress. Once we finish that, that will be deployed to dev, to our test server that we now have and use it to test all the changes and hopefully that will stabilize all the, and fix all the performance problems. Yeah, I think that's so you can talk then. All right, thanks Paul. Yeah, so there's a whole bunch of problems getting our metrics library up to standards so that we can actually use it. Hopefully one of these days will be able to upstream some of these changes into the official Prometheus Ruby client. The official Prometheus Ruby client does not actually support multi-process metrics. So we really would like to contribute this upstream and hopefully it's not too complicated a patch for upstream. In coming up in 10.3, we've got a few new features for CE. We're gonna be deploying Prometheus to Kubernetes automatically. So if you have a project in Kubernetes and you're using CD to deploy it, CD will also be able to deploy a Prometheus server to monitor your app. This is gonna be a super nice feature for Kubernetes users. We're also going to work on a first iteration of Grafana dashboards for omnibus users or actually specifically the omnibus admins so that they can check the health of their GitLab installs using Grafana. We're also going to be including sidekick and workhorse metrics in this. So that we've turned this on by default so that now omnibus users will be able to see the status of their sidekick and workhorse. Coming up in EE, we're adding custom metrics. So if you have a project, you'll be able to set your own Prometheus queries and display those in your project. This will make it nicer if you have specific business or other type of custom queries that you would like to add to your merge request workflows. We're also gonna be trying to add this integrated browser testing. So we will be running before and after site speed measurements and displaying that in the merge request interface if I remember correctly. And then Prometheus in gitlab.com production. We've been working on deploying Prometheus 2.0 which is an amazing new release for Prometheus. It has a 10 times reduction in CPU usage, 100 times reduction in IO usage. And because of this, we were able to find some bugs that were causing excess CPU memory usage and reduce the overhead of some of the recording rules that we had. We also started deploying Prometheus to monitor the runner pools. Previously we had no idea what resources and the state of all the gitlab.com runners were in digital ocean were. So we have been working on deploying Prometheus to monitor those. That one we needed to start out with Prometheus 2.0 due to the high rate of churn of worker droplets. And that's it. Let's look at the questions. Yes, the merge request issues. A lot of this has been going on in the Prometheus M-map gem. There's a separate repo for that. I can share that afterwards. How do we make sure we don't introduce new performance problems when introducing the metrics? That's a good question. Basically, any metric that has a large amount of label cardinality can be a problem. So one of the metrics that didn't work well when we are converting the influx DD metrics was the, what was the, it had four separate cardinality labels. It was the Rails controller action, the method and the caller. So basically there were something like on a pretty idle server there are 20, 30,000 metrics. Plus this was a histogram metric. So every label combination was also multiplied by 10. So it was exploding into, as Paul said, hundreds of thousands of metrics when running in dev. And this was just way too much. So when you're adding metrics to an app, think about how many label combinations you're going to generate and make sure that that's not an infinite label set. This is documented in the Prometheus upstream docs under the best practices section. Any other questions going once, going twice? Thank you everybody. Have a good afternoon.