 All right, wrong button. All right, welcome to the April monitoring team functional update. My name is Ben Kochi. I'm the team lead. We have a good team of people working on monitoring features. We mostly cover Prometheus monitoring at this time, but we're working on tracing and logging support within GitLab and gitlab.com. The problem we're having right now is we have a very small team, and it requires a large amount of knowledge. And we've had some setbacks in our output because of the lack of resources. So we're very interested in getting referrals and other sources of back in engineering time to continue delivering our features. So we have two open headcount for candidates, and we're trying to reduce the scope of some of our feature changes in order to get things done in a faster way. So 10-7. 10-7, we did some good stuff. We had some improvements to the dashboard so that we can see more information about the pods and the deployments in people's jobs. We also, in 10-8, we're starting to work on SLO alerting. So basically, the idea is we're going to be adding the ability to create Prometheus alerts within the GitLab user interface. This will be super fantastic because creating alerts can be a little difficult because you have to start to learn Prometheus query language. So we're going to provide a basic interface that will let you create new alerts with minimal need to understand the Prometheus query language. And then we'll be able, or including the Prometheus alert manager, to route those alerts to users by email as our first iteration. And we'll hopefully be able to get to other integrations as time goes on. We're also going to start working toward the idea that you'll be able to see your production logs from within GitLab. So if you have a Kubernetes cluster attached to your GitLab service, our first iteration of this is we're going to start showing the names. And then we'll be bringing in, we're laying the foundation to bring the logs, Kubernetes pod logs, into the GitLab UI. We've done a bunch of work to continue to stabilize the Ruby client. And it's been deployed in production. We didn't quite get it deployed by default in 10.7, so we're moving that to 10.8. Prometheus in production has been quite a lot of our work recently. We've been building the upstream Prometheus note exporter. And it's had a lot of new features, a lot of great contributions from the community. I'd love to thank all of the Prometheus contributors upstream for all the great features that we've been adding. Some of the important things you'll see linked in the slide deck, biggest ones being we now have NFS server metrics. And we fixed the NFS client metrics so that all the GitLab NFS servers can be properly marred. And we've also started doing some cool stuff. I'll show this link after the presentation that we've now have some really great SLO SLA graphs within forgitlab.com. And that's pretty much it. We'll go to questions and I can do a quick demo of the dashboard. So if we bring up the monitor.gitlab and we'll take a look at the Rails dashboard. And so we have this new thing where we can see the error rate, which is if we go back to the basics of what makes good alerting in Prometheus. We think about the reliability of a site in terms of how many errors are we producing and how fast is it. So this is called the red method of alerting. And so we can see very easily on here the error rates. So we actually have a real goal of 99.9%, which would be no more than 0.1% errors. But we're still working on improving the reliability of gitlab.com, so we're not quite going to reach that. So we've set a 99% availability threshold line here. And as we improve the reliability of gitlab.com, we'll be able to crank that bar down lower. And then same thing goes for latency. We can see that the average latency, the 50th percentile, and the 95th percentile latency is not great but not terrible. We have quite a lot of variation in web latency. And that's how fast the site is responding for every request. So let me bring up the chat and see if I have any questions. Yes, more automation is great. A lot of this is going to make it much easier for developers to see their products within gitlab. Joshua says, yep, we're working on production-ready metrics. Yep, so more users are trying to get to the Prometheus server in gitlab. And we're slowly working on coming up with these kind of reliability queries for users. Kathy, so yeah, natural language processing could be interesting. We actually, if you look at these queries, they're actually pretty natural as they are. If we actually look at what the query is, the Prometheus query language overall is actually quite simple. So if we look at this, we can see that we're asking for the Rails completed requests. And we're simply taking the rate of the completed requests and divide it that have a 500 code and dividing that by the completed requests that have a 200 or 500 code. So it's not quite a natural language processing, but it is a fairly intuitive language for querying data from the database. As far as, are we going to increase the reliability at a fixed rate? That's up to the production team and the development teams. I don't know. I don't have an answer for that. That's definitely a question for our production implementation team. Yep, any other questions? Yes, but I'm not as quick as to type. I just ask it. So you just include the 200s and 500s here. Is this the industry standard? Or why don't we include like 400s and stuff like that? I mean, we could pimp our reliability artificially, right? Yeah, well, that was part of the question that when we were composing this dashboard was we said, do we want to include 300 redirects and 400 messages? And we said, no, we really want to know how many, like, we didn't want a pad. Like, so it's pretty easy. I could just go here and just say, or we could just replace this and say, all. And you can see our actual error rate should drop because quite a lot of requests that we have. I think we can simply add all statuses together and that should drop if the graph refreshes. It dropped a little bit. Yeah, it dropped a little bit, but yeah. Most of our redirects I think are happening at the HA proxy layer. So maybe it would be better just to include all. I was assuming that it would be drowned out by redirects, but it's probably okay to include them. Yeah, yeah, that's, and of course, that's one of the nice things about the Prometheus query language here is that it's quite easy to just experiment and change and play with this data. Feasible to get monitoring info for clients so we could kind of performance model. Are you talking about the client view? Ah, you want to get metrics from GitLab users. That's a good question. We would, that's something I think we would want to go through our sales and developer relations and ask for access to their Prometheus metrics. So we could totally ask different organizations who are using Prometheus publicly like Debian and see if we could get access to their Prometheus servers or if they could use one of the remote right streaming options to send us a copy of the data. And yeah, I think that would be super amazing to collaborate with some of these, some of our customers to get access to their data. We had one idea a while back that we wanted to provide a monitoring hosted service and basically provide an auto support system that would allow customers to just click a button and all of their metrics would get streamed into a GitLab service. And we could give our support access to that. I think there's some open issues there for that. But that's, but like I said, we have only one backend engineer on the Prometheus, on the monitoring team. So these kinds of features would require significantly more staffing to be able to add like auto monitoring as a service for all any GitLab user that wants, any GitLab enterprise users that wants it. So Ben, can I give a bit of the broader picture? Sure. So what's happening right now is that we are getting metrics from GitLab.com. Then there's some metrics in GitLab, but not a lot. We really like Grafana and Josh already said we're gonna probably ship that with GitLab. And then it's really hard for people at our customer sites to access these metrics because they're only in the admin panel. So what we're gonna do, and that's the issue I linked for one, four, one, six, we're gonna expose the metrics by default to all users of an instance. So not just to the admins. So you don't have to be an admin anymore to log in. Because for example, I can view the metrics that are coming off of GitLab.com by default because I'm not an admin, I'm not logged in, I shouldn't have those credentials. But it's interesting to view the metrics. So everyone in the company, in the organization has access to them. This way they're also much easier to access for our support engineers. They wanna be streamed to us, but as long as they look with the clients, as long as they're with someone that has a login on the instance, or even not, as long as they can access the instance they can see. And then what we wanna do, instead of us making a performance of GitLab.net, which is awesome, but which only we profit from, and not all our customers, we should start dog fooding. We should start using those metrics that we ship with GitLab by default. Because our customers also need to see everything to diagnose their things. There shouldn't be this big riff between it. And that's not gonna be easy because we're running a hundred eggs, the skill of our customers. But I think we can get a long way. And the worst thing is we ship some metrics that our customers have no use for, or that are not populated, which is not the end of the world. So that is the vision. Have them on a like a path of the server, have them accessible by all users of that server, and maybe even two anonymous people. And make sure that we're using that ourselves and that that's the best version so that everyone's looking on the same page and everyone profits. Yeah, it would be great if we were able to completely dog food all the dashboards that we use in GitLab.com for end users. I think status.gitlab.com is not a previous backended system. I believe that's actually coming from like some other service. That's correct, that's correct that it's not. And I think we've had this discussion about using Grafana to replace it with Prometheus. But I think if we had a status page that was bundled with the product, that would be ideal because then we would just use the same thing that our customers use for determining kind of what's the health of GitLab. Of course, the fact that we are more distributed than most of our customers is a challenge here. But if we could just have all the metrics centralized in Prometheus and come up with a central status page, I think this would be amazing. Yeah, and Gregory says are we gonna re-implement all of Grafana? And I really hope not. I'm trying to keep our dashboard and graph implementations as simple as possible because re-implementing Grafana is a huge amount of work and I don't think we don't wanna get into that. One of the ideas was we could just bundle Grafana with Omnibus, but that is yet another really big tool to ship with every customer. So possibly, this gets into one of those situations where like the how much extra stuff do we have to bring in to support this? All righty, there was some comments here about status page and this is something that a lot of people use for their customers. I think also you have the needs of your customers that use your product is a bit different than your operations team. So if we had something similar to status page that had Prometheus as the back end, it would be really cool to have something like that for like a holistic view of like what the health of GitLab is. Is there any questions?