 Welcome to this talk about Prometheus when we see what it is, what is new, and what's coming. I am Jean-Pyvoteau, I work at InWitz and I have joined the Prometheus team in 2020. And I'm Richard, I work at Grafana Labs and I have joined the team, I think, in 2016. So let's talk about what Prometheus is. That is the Quick 101. It's inspired by Google's Borgman. It's a time series database internally working with 64-bit values for all numerical data. There is over a thousand community-created instrumentations and exporters, way more non-public ones. It is for metrics, not for logs, dashboarding, standby, Grafana, even before a joint Grafana Labs. It's a highly dynamic and built-in services covering, which means that you have a lot of different ways to put data or information about what systems to monitor into Prometheus quite easily. It's built-in with Kubernetes. You can do zone transfers through DNS. You can have a Firebase services covering where you can put in your own stuff from your own systems. And there's dozens of other things, how you can get the information about what to monitor into Prometheus without any additional work on your, it's literally taking it from your operators, from your orchestration, from everything which you might have. It doesn't have a hierarchical data model. It has an in-dimensional label set. What that means is, normally if you have hierarchical models, you might have continent, country and customer, and then you need to select your data by customer across all continents and all of a sudden your hierarchical data model is already wrong for that type of theory. So by just attaching labels and allowing you to slice and dice your in-dimensional matrix however you want, you can simply select by whatever you want. For this, you are using Promco, which is a functional language, which is basically doing vector math. What that means is you set your language up once or your calculations, and then you just toss whatever amount of data into this calculation and get the results set up. No matter how much data comes or goes away, you always get out what you want. That's similar to how weather predictions and such are made. This language is used for everything with imprimices, for processing, graphing, alerting, exporting, everything where you work with the data, you always go through Promco. Super nice. It's really a simple operate. You don't need teams upon teams or anything. It is a monolith, even though it's cloud native, and it's super highly efficient. We'll be seeing a few numbers later. It's a pull-based system which gives you super nice properties around a certaining that everything which should be there or is part of your service coverage is also monitored, and you always know what the latest status is, and if it goes away or if it goes down, you see this immediately, which gives you very nice alerting properties. Very important concept. Concepts are black box monitoring where you look at things from the outside. For example, do you reply to HTTP requests versus white box monitoring where you look in the inside of your code and you instrument from the inside? Usually in Prometheus Land, every service should have its own metrics endpoint which is unfamiliar for people who are used to agents. There are various advantages to having distant metric endpoint because you are not tied with your versions and you don't have this huge uplift when you have an upgrade or anything. That being said, we'll be talking about agents in a bit. And we have super hard API commitments. We even historically treated pretty much everything which was experimental as stable. What are time series? Time series are recorded values which change over time, temperature, your memory usage, how many requests you got. What you can do with individual events is that you merge them into counters or histograms, which is something which is also done extremely often in the networking scene, which had a lot of those, what is now cloud native scaling issues, like two decades ago, where you have this super cheap way of compressing data along the access which you care about and then emitting this distilled information about whatever you're looking at. It's super highly efficiency for both storage and transmitting. The examples are, you probably read them by now. Those are examples of how an extra exposition might look like. I even know people who literally print F in their C code and then just dump this file on the web endpoint and that is their premises integration and it works. Like we have all the tooling such to make this nicer, but if you so choose, you can literally print F or even just echo in your shell script. It is really easy. Talking about scaling. Kubernetes is roughly equivalent to Google's Borg. Prometheus is roughly equivalent to Google's Borgman, but with the Monarch APIs. And Google couldn't have run Borg without Borgman and that means also not their services for the last X amount of time. One of the indirect effects of this is that while Kubernetes and Prometheus were created completely distinctly of each other, they have inherently been designed for each other and they're also written with each other in mind. So we have people who are on second instrumentation at Kubernetes and they're also on Prometheus team, which already tells you how much coordination goes on between the two projects. And the one thing which is actually recommended by Kubernetes to monitor Kubernetes itself is Prometheus and they put insane amounts of work into Cube State metrics to make this super high efficient to get all this data out of Kubernetes. On the just instance of monolithic Prometheus level, there's long-term storages and such, but if you run a single Prometheus, you can get more than 2.5 million samples per second an instance. We've seen more than 60,000 samples per second in core. We actually went quite a bit above this in some more artificial testing, but this includes the storage, the alerting, the curing, all of a normal Prometheus and we compress quite aggressively. The largest Prometheus instance, which we saw had 125 million active time series and it was running. Long-term starch, there are two long-term starch solutions which have actual Prometheus team members working on them. These are Thanos and Cortex. Thanos is historically easier to set up and run, but slower in curing and such and the initial scale or their initial point where Thanos started scaling was the storage. Cortex on the other hand is not as easy to run, but it has gotten a lot easier. It started with scaling ingesters and curios and then took the code of Thanos to scale a storage horizontally. If you're guessing that Thanos is playing to do the same with the Cortex scaling code for ingestion curing, you would be right. As a pipe dream of myself, I would like to see Cort Thanos at some point. Maybe we'll do, maybe we won't. So, what's new? So, yes, so now we will see what you have missed if you have not upgraded Prometheus in the last year and we see a lot of shows there actually, not updating very often the Prometheus instance, but let's have a look. The first point is service discoveries. So, in the last year, so it means like since I think July last year, we have added five new service discoveries to Prometheus which brings a number of service discoveries by, for instance, like to 13 or something like that. We have added Digital Oceans, Gateway, Ethner, Eureka, Docker and Docker Swarm. And this is only the beginning since we know that we are getting at least three more new service discoveries in the very next release of Prometheus and there is even more to come from the community. This enables you to use Prometheus in a lot of new use cases as well. Then you have TLS and basic authentication. So, while Prometheus has always been able to scrape using TLS and basic authentication, the targets, now it is also able to export its metrics and to export its interface using TLS and basic authentication, which means that you don't no longer need a reverse proxy if you want to secure your Prometheus, proxy your Prometheus instance. We have also designed and written a new exporter toolkit for your go exporters if you want to also benefit from the TLS and basic authentication that we have done for the Prometheus server so that you can instrument your own exporter using that toolkit. And this is what we are using also for the official Prometheus exporter like the node exporter at your proxy might SQL the exporters. So we are using that, but we are also encouraging the community to contribute to it and to reuse it. PromQL also has seen a lot of changes in the last year. First of all, we have including new functions, some of them which we are really like really wanted by the community like last of our time which enables you to take the latest sample over a certain range. We also have some new features to write better queries and to get more insightful data like the ad modifier in PromQL. It's disabled by default because it can break some assumptions that some caching proxy might have about the PromQL. But basically what it enables you to do is to run a certain selector at a certain date and time which means that you can now see like over the last hour the four CPU that the four containers that take the more CPU know. And that is like a change in the behavior because like before you could only have like at each moment the four more containers that use the most CPU. The next feature is like the negative offset. So when you are storing your data informatives, you can decide to apply an offset to your data to say, hey, I want to see the data but like one hour back. Now you can also do that like one hour upfront which will help you debug and deep dive into your metrics a lot more easily. And then we have also composite duration which have landed which means that you no longer need to think about like, oh, how do I write one hour 30 minutes? Oh, I need to write 90 minutes. No, you can directly write one hour and 30 minutes. So this will also ease your write and read off your PromQL queries. We have enabled informatives a remote write receiver. So remote write is a process by which formatives can send its metrics to a remote system. No formatives can also receive the remote write metrics. It is a different way to share metrics than the federation which exists in formatives because like the remote write also enables you to pass on like the still markers and a lot of other information. And basically like what it enables you is like if you want to run a formatives on the edge or if you have a use cases that requires for like a more like push model while still keeping all the promises of formatives you can now do that. So formatives still remains a monitoring system still remains pool based but now you have new possibilities to move your metrics around and that will definitely enable new use cases for our users. The next and this is one of the most weighted features that we have seen and one of the most exciting it is the exemplars. So basically what is an exemplar? An exemplar is a way to attach an external data to a metric set, which means that next to your metric you can now have like for example a trace ID. This is the more common use case. So it enables you like when you have when you have an actual alert or a dashboard you can directly see an example of a slow metric. You can see an example of a fast query and then you can jump from your metric to your traces. And the nice thing is that Grafana already supports that they already support the exemplars and you can directly plug your point tracing Grafana plug your tracing backend like tempo or digger into your Grafana and jump from your metric to your traces. If the trace ID that you are seeing there seems some familiarities because like this is directly taken from the W3C tracing specifications. And last let's finish with the alert manager. So we have two new features coming to the alert manager. One of them which is time-based muting which means that you can decide now directly in the manager to decide that some teams should not receive some alerts during the weekend or out of business hours. And you can control that for each route. So this is like a nice addition that's required before that a lot of stuff happening in PromQL or in the Prometheus upfront. Now you can do that directly in the alert manager. And the second feature is the negative matchers. So now you can match, you can create a salons that will match the alert that do not match certain labels. So you can decide for example to put everything but production in silence. You can decide to put everything but something in silence, which is like also going to be helpful for the people who actually operate and are on call with the alert manager. So what's coming next? What's coming in the future? The first most important part or the first most important point is that we as a community are trying to be even more aggressively open than before. That has several different dimensions. The maybe most obvious one is that historically, as I said initially, we have treated even experimental features and interfaces as functionally stable and immutable, which is great for stability. But as Julin just said, it would not enable certain use cases. Like for example, if you have cash in process and you know that you have cash in front of you and you don't want to ever break them, you're kind of locked in. So by being more willing to have experimental features, A, which change current behavior and B, which just to change what it means to be experimental, we want to open up more flexibility, more innovation. Or in the case of remote read write where we literally just brand the experiment as the first version. So we have this stable basis and then we can innovate on top of this. A lot of our old assumptions are being revisited and we are deliberately enabling more use cases. Julin was also just talking about this with more services coverage or the exported toolkit also noted here. We try to make more of a box and use cases which are maybe not recommended by primitives like the agents, but still valid use cases from the end user perspective. Of course, while we would like everyone to be able to run their own metrics endpoint, for example, that's not the view which a usual enterprise security team will take. That you have like two dozen different ports open which are not even continuous and they cannot tell from the outside which ports should be open on this and that machine that tends to make people nervous. Putting all of this behind one single port and then either using paths within that port or to just push stuff through remote read write are totally valid use cases. And so we are trying to enable more and more of those use cases to actually enable people to operate with the stuff within the premises main org and not spreading this out to the wider community and as such creating differences in different approaches by upstreaming all of this or re-upstreaming in all of this, we hope to just make it easier to reuse the code which we have and to just make it easier to use. We have a few design docs. The slides will be linked and these are clickable. So you can just, you can read through them. You can give feedback. Another thing, a large focus point of mine will be imitation is the sincerest form of flattery and premises is the defecto system within all of cloud native and also beyond, which is great adoption wise and it is great for the project but it also means that at some point there start to be certain issues. If you look at the most current SIP observability end user radar, I made that wrong, the title is wrong that is the current CNCF end user observability questionnaire. You see that permittes and open metrics are in the adopt and in place one or respectively place five in all of the CNCF end user sort of part which is quite a statement. And if you look into the market be it open source projects, be it closed source, be it near to two permittes, be it super far away from permittes, there is a lot of interest in permittes and then parts of permittes. And we want to ascertain that everyone who chooses to use any of those is actually able to use them within the permittes ecosystem as they expect. So this is an ask which is coming from CNCF, from the end users, from vendors, from projects to support this cross testing, basically becoming more of a standard for cloud native observability. We already have two specifications out. This is one of the metrics which is the exposition format as you saw earlier and the permittes remote write specification which is how you can bulk push data from, for example, an agent or a permittes server, two long-term storages or in certain pipelines where you can then handle your data and put it onwards. The nice thing about starting with those two it's basically once the interface to get any data out of systems into a permittes compatible system and then at the second step, once it is already in a permittes compatible system, emitting all that data to whatever the consumer long-term storage or whatever is, that also automatically covers both pull and push, which is super nice for a lot of operational reasons with the guarantees that in between you have a permittes compatible thing which actually does all the cleanups and all the everything which you need to do for just plain expositions, once to basically have it in TSTP compatible format and then remote read write it. We have a variety of test suites already. We have a prompt compliance test suite. We have a remote write compliance test suite and open metrics compliance test suite. We are thinking about having TSTP and data correctness test suites as well. We might even have more that remains to be seen. It's also a little bit of how and who wants to have what basically just where we see usage of aspects of permittes. We want to just make sure that all of this is done in a compatible format. We'll also publish all of this on the main website, permittes.io, where we have regular tests. As of right now, the rough intention is to have versioned tests. So you know you have 2021-4 or whatever and that is the version of that particular test. And then it is valid for, I don't know how long, one, two, three minor versions of permittes, which translates to a few months. And then you can just really run those tests. So everyone using this knows that there is a certain expectation of what version range of permittes this thing is compliant with or not compliant with. Yeah, and also we'll be working with things. You have to actually have a mark of approval or some such, the calls with the lawyers and everything with the logo and such, that's still outstanding. And coming to all of those things which are coming and there's lots more in those documents which are also all links. So you can just click them through in the slides. Everything which we do is as of this year, recorded and open to join. We already always publish like our meeting notes and everything, but it was a remnants of this all being in-person that we didn't truly realize that we could just make all of this online and public to join and everything. We have always invited people to the in-person things but that doesn't scale. Whereas doing it online is supervised as being just recorded and published. But also it is open to join. So all of those just drop in, there is a calendar. We publish all of this on the YouTube channel. And I hope we were quick enough that we get lots and lots of questions. Of course, we tried to optimize for basically giving you this team in three levels. And now for actual questions which we are very much looking forward to. Thank you very much. Thank you.