 Welcome everyone to Prometheus Maintainer Session, we still have some people coming. Sorry, just start talking. Okay, we are starting. So we are newly initiated Prometheus team members. I'm Cema. I'm mostly working on client goaling stuff. I'm Brian. I'm a day job, I visit Grafana Labs. I work on Prometheus, especially on performance. Yeah, let's see some show of hands. Who here runs actually Prometheus? That is amazing. Okay, everyone. That is, yeah, for the benefit of the tape. That is basically everyone. That's nice. Do we have any newcomers? Yeah, yeah, brave people put their hands up. Yeah. I mean, it's good. Welcome. Welcome to the community. Welcome. Come to learn. Definitely. We will have some introductory slides so you can learn what Prometheus is for that. Let's start. Yeah, so we figured there would be some people who were totally new. So we put this stuff up. What is Prometheus? So we describe it as a metrics based monitoring and alerting stack. And the key tagline it's made for dynamic cloud environment. So it's made for environments where things come and go a lot. Project started at SoundCloud in 2012, was open sourced, joined the CNCF as the second project after Kubernetes. So Prometheus actually older than Kubernetes, but the second project into the CNCF V1 2016 V2 2017 it graduated, which is the highest level of CNCF project in 2018. And if you want to see a lot more about the history, there is a half hour documentary on YouTube, which is really well made. It's really nicely filmed. It shows it cooler than what it is. Yeah. And if you Google Prometheus documentary and you find yourself looking at space aliens, that's a different film. Okay. These numbers come from Grafana Labs, who announce at PromCon every year. So these are not all Prometheus. They're just the ones that people are running Grafana to look at. But the last number published 774,000 Prometheus in the world. So you're in big company. The terms of the community, we have had over 11,000 commits to Prometheus from 783 contributors, nearly 50,000. So go star Prometheus. If you haven't done it before, go to GitHub, click that star button. Let's see if we can get to 50K before the end of CubeCon. And people joining the project, there's I think 26 people in the Prometheus team. And every time someone joins, we get one of these emails coming out. So I'd like all of you to think about joining the team. We'll have some words at the end about how you can get involved. Okay. I'm going to hand over to Kim Alnout to talk about the architecture. Yeah. For the newcomers, like Prometheus is a pool-based metric system. There are ways to actually push metrics out of your workloads, but we are not going to touch that. This is the vanilla way, how you actually expose metrics. If you own the application or the workload, you can use client libraries to expose metrics. We have also collectors to make your life easier for the known, basically, metrics. If you don't have access to the systems that you run, we also have exporters so that you can expose metrics out of your running system as well. And then we have a single binary that packs a couple of major functionalities in place. And that binary actually scrapes all these endpoints. There is a TSTV that we store all the collected metrics. And we have a rule engine and an alerting system in place. From that, we already show that there are a lot of targets. We call targets what we want to scrape those endpoints. But it's really hard to actually maintain a list or a static list of things to actually reach out those endpoints. So we have service discovery. So if you are using Kubernetes, what not, you can just use it out of the box. We will discover everything on your workloads and just scrape all the metrics. Then you can query it using Prometheus web UI or Grafana or any automation that you have. We have a query language for that and you can do pretty advanced stuff using that. And as I told you, we also have a rule engine that you can actually alert on. And the whole idea behind this architecture, we really want to make Prometheus reliable. That's why it's a single binary. You can just drop in your work environment and it will just work and collect everything. What else I can say about the architecture? Well, let's say if you have questions, we want to leave a lot of time. This is a 35 minute session and well, we started early. So there's like a lot of time to fill with questions, get thinking about that. We got a few more slides, but you already have a question. Okay, go for it. We need to pass the mic. What is the relationship between Prometheus and APMs in general? Like Uralic or that kind of stuff? You won't take me. I don't know. I've never used Uralic. So I think like APMs are doing some high level things and nowadays they really think Prometheus is ubiquitous enough. They support Prometheus workloads as well and they scrape that. I think that's one part of that. But in general APMs does a lot of things about your running workloads and they come with some sort of agent and then they try to collect some generic metrics from outside from your workloads, right? So it's kind of a black box monitoring, we say in terminology with Prometheus. It's more of a like a white box monitoring approach. We need to actually instrument that or like we have also exporters something in between so we know how to actually instrument certain applications that's business logic like the metric business logic packed in the exporters and they can also export that. So those are more or less differences but we also have other Prometheus maintainers. If they want to comment, feel free. I mean I can comment but I will be happy with questions but my opinion is to introduce yourself. I'm Bartek, I'm maintaining Prometheus as well. I actually co-founded Thanos but my answer is like what's the difference between APM and Prometheus? To me there is none. I think you can use exactly the same momentum to and tools to kind of debug your application as a developer and that to me was application performance metrics, right? So I feel you can kind of have the same use cases feed by Prometheus, yeah? Thanks Bartek. No worries, I'll be there for questions but go on. Let's leave questions till the end of the slides but yeah, thanks for that. If anyone's like really confused, do jump up but where did we get to? So we went quickly through the architecture. Yes, let's go what's coming. We're going to talk about what has been released in the last six months and Prometheus has a lot of associated projects like the node exporter that gets data out of Linux computers the Windows exporter that gets data out of Windows computers and so on and so on. So we're too tired to get everything in the last six months so we just looked at Prometheus itself. The big thing is what are called native histograms. This is a way of getting much, much better resolution out of your histograms. This is heat maps. The one on the left is what you get out of previous Prometheus or like most people are getting this kind of resolution and the one on the right is the so-called native histograms that are far higher resolution and if you want to find out a lot more about that we have links here to three different talks given by Bjorn Rammstein. Is Bjorn in the room? No. He's not. Probably couldn't get in. So yeah, Bjorn is the professor of histograms and has talked extensively about that. So just one thought that he did give us if you have this high resolution ability to store high resolution then you need to be careful that your heat map one day doesn't look like this. Don't click on it. You don't always recall, right? Something I personally am working on is reducing the memory of Prometheus. So the picture on the right was a fix in 2.39 which was for Prometheus in the hundreds of megabytes hundreds of gigabytes, excuse me. What am I? What am I in the 90s? Hundreds of gigabytes. It could get into this state where compactions were taking a tremendous amount of memory and so fix that. And then in 2.44, which is due out this week but I'm kind of busy, I'm supposed to be releasing it, sorry, but I'm standing here so that's why it's not being released. Yeah, so there's a thing we call the string labels change which should take you down about 20% in memory for basically, for most people. And you can try that out. If you look on the release notes for 2.43 there's a special image you can try that with. Are you doing this bit? Yeah, I can do that. We already discussed, we had a dev summit on Monday and we already discussed a couple of things that is coming. One of the major things as we already mentioned is like there's another component called alert manager and it used the front end for alert manager used to have written in L, very esoteric language. And now we are just in the process of rewriting that to react so that people contribute more and we can also be in sync with the other Prometheus ecosystem tooling. We also have some meta data improvements coming. These are some part of like also I think some other standard like open metrics whatnot but they are coming. We also decided on improving exemplar support. We're going to persist them. That's one thing and you can also use them for alerting as well. And there's also ongoing initiative to rewrite the remote write API. It's not there yet but there are some improvements in that as well. I think the major thing is about transactional rights. So work in progress. So a lot of work in progress. As we already talked about this, we really want to expand the community. We want to get like more contributors, maybe more maintainers and that's why we actually need you. And for that we also have a new thing in CNCF like KubeCon, this contribute test. Those are like workshops. You can actually get there with your laptops and we will help you to contribute Prometheus. There are several of those for different projects. This session will be led by, I think, Bartek and Gattem on Friday. So take a picture. Come along. I mean not just run it, also give back. That would be nice. As an example, we put that here. You don't need to write code. You don't need to write documentation. You can actually help us to run stuff, test stuff. That's also a contribution. We really appreciate that. For example, we put this PR in there. It's been sitting there for a couple of months if I'm not mistaken. Somebody needs to test that on Azure. If you are running that, we don't have access to an environment that we actually run. But if you are using Azure, just give it a try. This is about service discovery. If I'm not mistaken, if you are running it, give this patch a try. If it's working, just comment on that. We'll trust you and merge that or show us what's wrong with that so we can fix things. With that, we are open for questions. Okay. Who's got the microphone? Bartek's got the microphone. Who wants the question? Can you say something about the relationship between Open Metrics and Prometheus? Yeah. I have a second question. No. Just one at a time. I hate it when people... Sorry, not you personally. We've got hundreds of people here. Open Metrics is a separate CNCF project that was set up with the aim of writing a formal standard for a metrics format, like how you can write down metrics in a file or on the wire. And it was... So those were the aims. A, to be a formal standard, and B, to be based on Prometheus. So that was set up. It happened. There is an Open Metrics standard. It's published. And it turned out to be very difficult to get all the edge cases defined. So it took a very long time. But that is the relationship between the Open Metrics project and the Prometheus project. Since that happened, I think the broader community has shown much more interest in Open Telemetry, which is another CNCF project. So there probably won't be further iterations of Open Metrics. I can't speak for that project. I'm not on that project. It's a separate project, but that's my guess that the focus will be more on Open Telemetry going forward. There is also two facets to problem. Prometheus actually scrapes Open Metrics endpoints, which you can actually use it right now. I don't remember if we are fully supporting all the use cases. It's still experimental. But we support 13 use cases, like exemplars, right? It's also part of Open Metrics. As a client, there are lots of client libraries. There's the second facet of that. You need to expose metrics in Open Metrics format. There are some nuances to that. We only support that in client Python, if I'm not mistaken. Even client Golang hasn't supported that yet. And as Brian mentioned, maybe there could be some second iteration of the Open Metrics and it would be easier for us to integrate. So we are right now waiting for Open Telemetry of metrics. We'll see. I think I can add that there are two different governance, two different teams, and we are working to collaborate better and get back actually more iterations of Open Metrics potentially. And how I understand that is Open Metrics is really good as a, almost like primitive exposition. So a good protocol for scraping and then Open Telemetry and the metric part is amazing for pushing. And this is what we see as emerging kind of use cases and probably will, at least from Golang perspective as I maintain that we will probably have both maybe in future, but really would love to see Open Metrics to evolve as well. But we are working on kind of merging those two teams and kind of helping to grow both maybe use implementation of it and actually format specification. All right, let's have another question. Thank you very much for that question. Another question. Hi. So I use Grafana for alerting and we deploy Kube Prometheus stack and then Prometheus ships with an alert manager. And it also has some alerts in it. And if you use Prometheus as a data source it will map those alerts to Grafana, but I'm not able to actually alert or at least couple an action to those alerts. Is that something on the backlog or is that a bug or? Okay. So I should have mentioned at the beginning I'm standing here in place of Josh who originally stepped up to to co-host this talk with Kemal. And Josh couldn't make it so I'm standing in his place, but Josh knows way more about Grafana alerting than I do. Do you want to take? Oh, you know the answer. I can speak for that. I'm Simon. I'm also a Prometheus maintainer and mostly alert manager maintainer with Josh. And also quite familiar with Kube Prometheus in addition to that. The idea of Kube Prometheus is really using like Prometheus for collecting metrics, alert manager because this is the the thing that will dispatch alerts. So it will basically, again the alerting rules are configured into Prometheus. They are not configured into alert manager. Alert manager is just there to dispatch the notification group the alerts dispatch notifications. So this is why in Kube Prometheus itself we are mostly focusing on like the vanilla stack I would call that like using Prometheus for alerting using alert manager for dispatching notifications and so on. I can't speak for the Kube Prometheus maintainers, but I guess that if you want like to integrate also with Grafana alerting that could be possible to have that as an add-on. But yeah, again and like Brian said the idea is Grafana alerting would rely more and more on the alert manager code base in general. I don't want to speak again for Grafana. Did that answer your question? Great, thank you. We got one over here. Thank you Simon. Hi, my question is about talking about the open metrics going in the direction of open telemetry which become the factor tool to collect logs, metrics and performance. The use cases in Prometheus are going to be covering more integration with open telemetry. Right now it's a little bit more complicated than of course using the normal collectors that we can have from the project. So it would be good to know if they're going to be more in that direction. I missed which direction? To have the use cases for collecting or integrating with open telemetry. Is Prometheus going to enter better with open telemetry? Just to clarify we didn't claim that open metrics going to the open telemetry direction. We don't know anything about it. We are not the maintainers of open metrics but we don't have Ricci in the room he might give you more information. From the Prometheus side, like we don't want to be king makers or anything about between those formats. We are planning to support them like there are some initiatives. The last Dev Summit not the one on Monday but the previous one in the promcon we actually talk about that and there will be an OTLP initiative in Prometheus between client libraries going to support that maybe they will expose that or Prometheus going to be ingest that. This has been discussed we will experiment with it but as far as I remember there isn't any implementation yet, right? Maybe we can ask Gaten but he's also not in the room. There are experimentations but it's worth to mention that we are working together with open telemetry so the transition between data formats works but from our perspective let's be honest we have native histograms which are much better than whatever we have in the community so far but open telemetry doesn't have that so there are lots of innovations in each direction so we want to also be very fast and very efficient. Hang on, that's a case where the open telemetry what they call exponential histograms but it's not the same. But the definition was built in parallel with the implementation in Prometheus. I don't want to take you up on your Latin but I think Prometheus is the de facto standard and open telemetry is the de jure standard. Anyway, we'll be curious to hear what exactly you are missing for example or what use case you are missing and we can take that as a team and check if we can do anything else here. Okay. Just saying, becoming part even that is we cannot comment on that. There is also, there was a really nice talk by Ganesh about open telemetry histogram versus Prometheus. Oh yeah, so that talk the recording, that was observability day which was yesterday. Seems like a long time ago. And the video for that will be out if you didn't make it in a couple of weeks I guess. Okay, who's got more questions? Yeah, many questions. Okay, try to remember. Hello. I have a question about the scalability of the Prometheus instance itself. For now, from what I experienced, we can only scale it vertically. Is there any road map feature that will be changing with the scale Prometheus Yeah, your question is that the Prometheus scales vertically, which is true as Kimo was saying. Actually because of that we moved to the Victoria Matrix cluster and is there any feature in future road map to be able to scale Prometheus horizontally? Well, I I'm sure Bartek and Kimo are too modest to mention it, but there's another project called Thanos that they maintain which is, so the Prometheus is to be a single process so it's very easy to run and that's not going to change. So it does fundamentally limit how, well that and the amount of RAM you can afford fundamentally limits but I did fix some of those things with the RAM, so so I can definitely mention a project called Thanos which is taking the code of the modules of Prometheus and adding a lot more modules and turning that into distributed system. Kimo, do you know of any other projects that do that? Yeah, a couple of them, like Cortex and Mimir, if you want to have something more centralized you can always use Mimir, right? Yeah, thank you. It turns out that I'm a maintainer of Mimir. And you were too modest to mention, yes. Yeah, so generally, yes, that's a no goal, right? And however we are, you know, working with those projects, like really any kind of distributed Prometheus projects, you know, we are sharing the same code base so in the end we are kind of like very connected we are understanding each other issues and Prometheus is adapting to help maybe with APIs to have those distributed storages like Thanos, Mimir, Cortex and you know, and others whoever wants to join to scale better, right? So this is our way to the optionality of, hey, you want to distributed thing, go to those projects, right? And this is by design. We will never it's actually amazing that we have a very nice scope down project where we can collaborate on amazing scraping and amazing kind of like simple single instance database and amazing maybe agent and collection capabilities and really focus on that. So this is really a good thing in my opinion. Any other questions? I think you asked already, sorry. When everyone else runs out, you can have your second one. Hi, super quick question. Does the Prometheus the experimental mod agent is stable now? A few years ago it was introduced with that you can use the Prometheus just as a proxy for the mod writers and use the begins like Thanos, Mimir, Victorimetus whatever. So is it stable now or not? Definition of stable is brought to experiment. I think we have official level of that. So we know what the definition is. However, I don't remember if we moved from experimental and production we will double check for you. I think it's a good moment to actually switch because I think it's running on production everywhere so far. So I would definitely ask if we can move it to production if it's not already. Yeah, I suspect it's currently marked as experimental but just because we forgot. Sorry, this is maybe Camel kind of talked about it but I missed it. So if an application wants to use open telemetry and OTLP to push metrics which component in your architecture actually is there to receive it? Or do you actually have a component that's compatible with open telemetry OTLP and all that stuff? Not yet, but it will be Prometheus to support ingestion. Yeah, we have. It's called open telemetry collector. Yeah, well it's not our component. As of today there's a part of the open telemetry project called the open telemetry collector that fills that niche. Yes, exactly. However, we are working on kind of having negative ingestion of open telemetry metrics of course and again I mean we are definitely we will try to experiment with client goal and for example exposing open telemetry push as well if that would be technically efficient and so on. Thank you. Yes, it's just a quick question to follow up on the question about the high availability and the horizontal scaling. I know that Mimi right now has some kind of deduplication and leader election of Prometheus instances and I assume Thanos and the other projects do something similar is there any plans to move that logic into Prometheus itself? Well, it's just for example if you have an idea what the main Prometheus instance is you will you could for example with remote rights you could at the time of the remote writing you could decide whether your data is actually going to be lost when you write it further on into the bigger cluster. So I think well first of all let's kind of set some background as people said they were new to the project so Prometheus is as we said a moment ago one process and it's designed that way to be very simple to run so that leaves people with a problem if that one process crashed or the machine went down or something like that then they have a gap in their metrics so sometimes people run too and they call this HA I like to put the the scare quotes around they call it HA that's the fact so the so the two different approaches Mimi and Cortex deduplicate on the seat of the data query time and compaction time I think to your point there's a PR from Oleg Forgotten is GitHub handle but there is an optimization which basically stops the secondary from sending the data if it's going to be thrown away is that what you're asking about when you're remote writing to Prometheus oh right is Prometheus going to implement HA oh that's a good idea I've never thought of that file an issue yeah definitely PR is welcome exactly let's show us the use case I would love to have the remote write somewhere I'm not kind of thinking Prometheus but yeah any other question come on there's one over there let me check I put it on Slack and Twitter as well let me see if I got any what's your recommended way of doing long time storage I recommend Thanos it depends what you want to do right the great things about Thanos is it's very easy to install as an add-on if you have existing Prometheus it can run as a sidecar it can do long term storage of your data on a blob store like an AWS S3 so it's pretty easy to understand what's going on it's pretty easy to get started and you know scales up that bit's great let me see you want to say what's great about Mimir it's like Mimir is always more centralized right so if you need something like that to collect everything in a single place with multi-tenancy attached to that and you can always use that it's a Grafana project and it plays nicely with Grafana as well but like that's Thanos has some other functionality with the sidecars what not if you need that like for the rest they are more or less same projects like they're solving the same problem with different trade-offs at the end depends on your preferences or use cases alright thanks next question hello imagine that you have two Prometheus instances for haa purposes and also two long term storage TSDBs I'm looking for a tool that when I query these two long term storage TSDBs it uh as you mentioned like it duplicates the data because both Prometheus are writing to both long term storage TSDBs and when querying from Grafana I query the long term storage TSDBs and uh so Thanos does that right? Yes this is perfect use case for Thanos sidecar because we're using Victoria Metrix as TSDB and I found some other tool on github Promxy or something which kind of does it but yeah Promxy is not obscure it's actually amazing tool like from our contributor as well so you can try that for sure it will uh you know solve what was the name of the project? Promxy PROMXY yeah amazing yeah amazing contributor is doing this you know Thanos squareier will do similar for you you need to kind of like add maybe sidecar and additional stuff but yeah that's a way okay so we're all learning things thank you okay another question maybe just that you don't need to store that data like you can just have sidecars and Thanos squareiers and you can just have the deduplication and that global view of your Prometheus so there's no he's using a third party uh non-CNCF non-CNCF project I'm not gonna talk a lot about cool so I have a question around where is this like new project called Perseus from Kordash community do you have any like hot takes or any like future plans for Prometheus to integrate that or do something with it? to be honest I don't have any idea does anyone in the room know about Perseus? I can't speak for Augustin but Augustin is like looking into the Prometheus UI one of the maintainers mostly with Julius and is also the developer of Perseus so again I can't speak for him but really the idea is that everything would converge at some point at least using the same library is the same code base yeah and even with Thanos too so visualization is another kind of non-goal of the Prometheus project Prometheus comes with a UI which for a long time was just just on the one border of unusable it's got a lot better in the last year or two with syntax completion and stuff like that but broadly speaking visualization is not a goal of the Prometheus project there are other tools that do that yeah I think Perseus essentially does the Perseus is essentially a dashboarding solution that is coming I guess to compete with Grafana a little bit so if you want to contribute to any of those you're welcome last two minutes one over there are we you're right we're pretty close yeah so basically we are scraping in our cluster metrics from the Prometheus of course and basically we've hit some limits that there are too much metrics to get from and basically we found on the web that there is a Thanos you've mentioned many times but there is also something like Prometheus Federation I'm not familiar with that basically but we think within the team that go with the Thanos way go with the Prometheus Federation or maybe combine them both to have this Federation that one Prometheus will get some metrics second Prometheus will get different metrics do you have some opinion maybe what we should consider what we can go with and in this combined way or maybe the Thanos way is the better one as we already talk about it you can actually shard your targets and have different Prometheuses about Federation Federation is a feature of Prometheus that predates all of the Thanos and so on and the idea was you might have like 10 Prometheus if you had like 10 data centres you could have 10 different Prometheus and then one would pull a subset of data because you couldn't fit all of it in one Prometheus but you could pull a summary of the data or a subset and that's the idea of Federation as a Prometheus feature so if that suits you you need to design the subset you need to figure out in your head am I thinning out the data, am I summarising it in time, am I summarising it by throwing away some of the detail how am I summarising it to get it small enough to fit in one Prometheus if you can do that it's still a feature of Prometheus but in the meantime some of us built all these big distributed systems that you can just throw everything in and run a billion metrics if you can afford it like sharding approach I was mentioning you can shard your targets and put some Thanos sidecars maybe and have a global view that's also a way or like you can shard the centralised Prometheus there are some trade-offs attached to that there are lots of solutions actually and we don't have a recommended solution I think you the thing is it's quite hard to figure out the Federation essentially to figure out your summarisation method you have to figure that out what suits your business your situation so that's whereas the sharding thing is more automatic but thank you we need to stop how many men will they send thank you very much