 All right, cool. So what is Prometheus? Prometheus is a metric space monitoring and alerting stack. When I say stack, it's end to end, as in you get the instrumentation libraries to instrument your applications and also exporters to get data from other systems that are not instrumented with Prometheus. It also has the scraping, which means the collection and the storage layer. And it also has a querying, alerting dashboarding layer, which is the UI and the way you interface with all the data that you collect. And this is for all levels of the stack, as in you can monitor systems, applications, to your hobby projects, to weather, to anything. It's very generic. What differentiates it from the older monitoring systems is that it's made for highly dynamic cloud environments, as in when, for example, Kubernetes where pods come and go. Prometheus is really, really good for environments like that, where you don't have static hosts, but have pods that come and go and change over time. Prometheus does not do any logging or tracing. It's very focused on metrics. It does that one thing, and it does it really, really well. It doesn't do any automatic anomaly detection. You have to write the alerts yourself. We have a really powerful query language, as you'll see, but you have to write the alert queries yourself. We do not go and suggest any anomaly detection or anything. It doesn't have scalable or durable storage. So what does that mean? So essentially, with Prometheus, you can comfortably store up to a few terabytes of data, as in you can only scale to a single node. And you can store, the typical retention is two weeks, but you can also store a few months of data, but typically this is not for storing years and years of data or like scaling beyond a single node. Yeah. So it started off in 2012 at SoundCloud. They were, it was a bunch of ex Googlers who moved to SoundCloud and they really missed the monitoring system that they were used to. So they started working on Prometheus. They started adopting it internally at SoundCloud at scale and they finally published it end of 2015, I think. It joined CNCF and V1.0 was released in 2016. V2.0 was released in 2017. You might expect V3.0 to come, but we are at V2.40 right now or something. So essentially, we're still on V2.0. It works really, really well. And yeah, it was a complete rewrite of the storage engine and it's kind of future proof. So coming to the architecture. So you have your applications that you want to monitor that you want to collect metrics from. The way you do this is you instrument your applications with Prometheus or Prometheus-compatible client libraries. And these client libraries expose a HTTP server. So essentially, you need to expose your metrics on a HTTP server. You do not push your metrics to a central system. You just say, oh, if you hit this API endpoint, which is slash metrics typically, you get all the list of metrics. And this endpoint is being hit regularly by Prometheus and that's how Prometheus collects the data. Now, if you want to monitor things like the Linux system or MySQL or other systems that are not instrumented with the Prometheus client library that are not exposing this endpoint, but they have their own way of exposing analytics. For example, with MySQL, you can run a bunch of queries and you can get the analytical data or you can get the metrics out of the number of connections, the size of the DB and things like that. Or for Linux, you can basically scrape proc, the proc file system to get the data out. So we have exporters, which you run alongside these systems. What these exporters do is they just basically go talk to the service in its own native language, in its own native format and take that metrics data and convert it to Prometheus. So it also exposes a HTTP server that talks Prometheus. Now, we have Prometheus, which just basically goes to all of these services and collects the data and stores it in the TSTV, which is the local storage engine of Prometheus. Now, Prometheus needs to discover what services exist. And that's where service discovery comes into picture. And this is something really, really powerful because suppose you're running a collection of services, maybe say, you say, I want 45 of my services to run. You have auto scaling up, that goes to 47 or 43. You have a dynamic number of services running and now you're monitoring them with a push-based system. So you don't know when only 41 of them are running. It's really hard to detect if one or two services are malfunctioning. But with Prometheus, it talks to the Kubernetes API or it talks to the service discovery provider. It knows, oh, there is 45 services running and here's the IPs of all of these. I'm gonna go and try to get metrics from them. If one of them is down, we will know immediately because Prometheus will fail and it'll have a metric that says this service is down and we can alert on it. So that's one of the advantages of the push, the pull-based system over push. So that's the service discovery part of Prometheus. And then we have the UI side of things. So we have a web UI, which is really powerful. I still regularly use it, even though I work for Grafana. It works really well for debugging purposes and for dashboarding and for you to look at regular dashboards, we recommend using Grafana. There's also a lot of people who wrote a lot of automation with our API where you just call Prometheus with promql and it gives you all the data. So there's this, you can say the data usage side of things and also we have alerting. So Prometheus has, so you can write alerts in Prometheus, Prometheus will evaluate these alerts against the data in the Prometheus and then it sends it to an alert manager, which basically deduces and routes alerts. For you can say, oh, for this namespace, send it to this Slack channel or for a critical alert send it to pager duty, all alerts send to Slack channel. You can do this routing, that's the alert manager component. One thing I do wanna mention is the robust form of alerting that happens with Prometheus because the very little amount of dependencies it has. Network is very prone to break and if you put your alerting on a system, on a distributor system or on a system that relies on networking, it's less robust than putting it in Prometheus and it's as close to data as possible. That's really one really cool thing about Prometheus. Yeah, the selling points of Prometheus are its dimensional data model. So when the data model came out, it kind of literally changed the world. We will talk about that. We have a really nice and powerful query language built for Prometheus and Prometheus style data. We have a simple and efficient server even though it's a single node server that doesn't scale typically. It really scales vertically on a single node and also our very powerful service discovery mechanisms that we are constantly adding new service discovery mechanisms to and improving. So talking about the data model, what is a time series? In my mind, time series is an identifier for a list of timestamp values. For example, the stock price of a particular stock, that's a time series because as a timestamp there's a value. The weather or temperature in a particular city. There's a lot of different cities for each city. There's a weather dot, I mean weather for Detroit, weather for Chicago and stuff like that. That's a time series. But how do we identify this each time series? That's where Prometheus shines and it has this label-based time series. For example, HTTP request total with the labels that describe that particular thing. So it's flexible. It doesn't have hierarchy. There's no dots here. It's not HTTP request total dot ingenex dot 200 or anything. So if there's no dots here, you can add and remove labels without your queries and dashboards breaking. Which is really powerful. And like in the previous model, if you're used to graphite or stats T, you have like HTTP request total dot ingenex dot 200 or something and you have to guess what the 200 is. You have to guess what ingenex is. But here it's very explicit what each of those labels mean. And that's really powerful. For the querying, we have our own language called Procure. It's a functional query language. It's built for Prometheus and Prometheus style data. It's not S-U-L, which I think is a huge benefit. It's very easy to reason with. And once you get used to it, you cannot live without it. And one really cool thing I wanna mention is today, if you are a Google Cloud customer, you can go to Stackdriver and before you used to have this clunky UI where you have to click a lot of things to understand but you can now use promql to query that data. It's really nice and really powerful and more and more vendors are adding support for it. Yeah, for example, let's see how selecting all the partitions that are more than 100 GB capacity that are not mounted on root looks like. So first we have the node file system bytes total, which is a metric. So if you select this, this is gonna give you ideally the amount of disk that the file system is using. And when I say not root, you can say mount point not equals slash. If you want it on a particular root, you just say mount point equals that particular path and you're gonna get the data for that particular path. Now this is in bytes and we have to convert it to GB. So one dot E nine, one divided by one E nine and that's greater than 100. So it's greater than 100 GB and it's gonna give you a list of all the time series or metrics that match that condition. And here you can see, okay, on this particular instance, this particular mount point is taking more than 100 GB, it's taking more, oh, 118 GB and things like that. It's kind of very easy to reason with, at least the easier queries. So you can write really ugly promque, but it's better than other languages, I think. Yeah, and here's what is the ratio of requests across all my service instances. So you can say ratio of errors. So you can just say, hey, give me the list of error, I mean, give me the rate of errors divided by the total will give me the ratio of error. So here it's like 2.9% of my requests are errors. And then you can also add custom groupings to it. For example, give it to be by path. You just say some by path and you kind of group by path and divide by path. And you can see that, okay, here's my worst performing in point, which is basically about 9% of my requests are errors on this topics endpoint. So you can write a query that says, if my error threshold is more than this, send me an alert, you can get that query, your alert will page you, you can get that query and then you can dive into it, you can add dimensions, slice by dimensions to understand exactly what is causing that. And alerting is very similar. So basically you write an expression that matches a list of time series. And whenever there's a time series there that alert condition is matched, you can send yourself an alert. For example, if you say more than, if I'm seeing more than 5% of errors, send me an alert. But one really cool thing is the four period here. Four or five minutes. So you can have some weird freak networking incident where you just have like a blip of errors and then it's back to normal. It's fine. It's just a small, tiny blip of errors. You don't want to be woken up in the middle of the night for that. You will want to be ideally woken up if this is continuing regularly. So you can configure a four period and only if this condition is true for a particular amount of time you get paged. And that's also really powerful in reduces the number of pages that you get. And then you can add labels and then you can route by labels and you can do a lot of really cool things for alerting. Yeah. This is something that I'm also really proud of. It's very efficient. And even though it's a single node system that's not distributed, a single node works for a lot of the companies. It can scale to one million samples a second with tens of millions of active series. And this is more than enough for most organizations. You don't need distributed storage. You don't need a complicated monitoring system to run. You can just run Prometheus, which is single node, extremely robust and focus on your applications, not your monitoring. That's really powerful and it's really simple, really simple to reason with and really simple to operate. We also use the gorilla compression on our data, which means a 16 byte sample which is a timestamp and value compresses down to just one to two bytes, like 1.6 bytes a sample on average or 1.2 bytes a sample on average, which means even with tens of millions of active series on a single node, you can keep like two to three weeks of data or even months of data without having to resort to a distributed or network storage. And that's really powerful. Some people keep years of data on it, but you have to take regular backups and stuff, but it is possible. There's nothing inherently broken in Prometheus that you cannot store more than this. It's just the limitation of disk space that is basically stopping you. Yeah, so one thing I wanna highlight is the exporter ecosystem. Again, Prometheus is a pull-based system. It needs to go and scrape metrics from things. And these things might not be talking the Prometheus protocol. And we have a huge ecosystem of exporters. When I say huge, I mean massive. Like every time somebody is like, how do I monitor this with Prometheus? The first thing I Google is the thing Prometheus exporter and I will always, almost always find one, including speedtest.net exporter or like a router exporter for like Fritz box back in Germany. You can export that, the talks to that API converts things to Prometheus and gives me data on how my router is performing. The ecosystem is huge, which means you can monitor many or most of the popular services or things that you run with Prometheus. So that's the exporter ecosystem. We also have a lot of, we have the JSON exporter with this very generic. If an API returns JSON and you want to convert that to metrics, you can do that with the JSON exporter, but you can also natively instrument a lot of the applications and libraries with Prometheus. And this is something we are seeing more and more of. Cool. In conclusion, yeah, Prometheus with its dynamic data model, the query language and simplicity is a really, really good monitoring system for cloud native environments, also for your hobby projects for everything across the stack. Yeah. Use more of it if you are not already using it. So that's me with the intro and now Ganesh. Yeah. So in the deep time, we will see what is the few of the highlights, what are new in Prometheus in the last one year. We'll mostly look at the Prometheus server itself, though there are lots of other projects under Prometheus.org, which have done great stuff. And we'll see what's coming next. Recapping some features of from QL, which have been presented before. Two of the things which let you control the time of a time series. So a query works in a way where you mentioned the time when you want to run the query, but you don't always want all the data to be fetched for the same time in the API. Sometimes you want to fetch data for the past sometime in the future. Sometimes you want to pinpoint the exact time when you want for that query. So Prometheus supported offsets, which lets you move back the time in the past. But in version 2.25, we added support to move into the future because sometimes people like to forecast things and put that data into Prometheus. So you can query data into the future. And with the help of the at modifier, which is the orange text that you see as at end, it pins the time for that vector selector that particular time. So these were introduced more than a year ago, but it became stable in the last, we announced it stable in the last one year. And recently we also added trigonometric functions in Prometheus and an operator called arc tangent. The first example shows the angle is in degrees. So we convert it into radian using the radian function, then we take a sign of it. Yeah, these were the new things in the promql. Now before I talk about this particular feature, Prometheus has a feature called remote write where you can configure Prometheus to send all of its data that it's getting to send it to a remote storage, for example, Thanos or Cortex for long term storage. And in version 2.25, Prometheus received the capability to receive the data from the remote write. So you could have a Prometheus remote write into another Prometheus. So this also came more than a year ago, but in version 2.33, which was within last one year, we declared it stable. I think you have to still enable it using a feature flag. And talking about remote write mode, you don't always want to store all the data in Prometheus because it's possible you never query Prometheus. So in version 2.32, we added a new mode in Prometheus called the agent mode where the Prometheus does not store any data in a queryable fashion on the disk, which reduces the load and the storage required for the Prometheus. And it just scrapes the data, stores it in a write log and forwards it to the remote storage and clears the write log in a regular fashion. So this was added to make the remote storage so easy. And up to recently, Prometheus released a new version every six weeks. And once a new version was released, we rarely ever put the bug fixes into older release. And we always ask you to upgrade to the latest release if you wanted a bug fix. This was not really helpful lately. And starting 2.37, which was released in July, we have started releasing the long-term support releases. First one was 2.37, which was released in July. We plan to support it until Jan. I think the plan is to have one or two long-term support releases every year, which will get critical bug fixes. And another fallback before this particular feature was that if for a time series, you got a sample, let's say for a timestamp thousand. And for the same time series, due to some timing issues, a new sample came for a timestamp, let's say 900. Prometheus would reject it because it could not support timestamps that came out of order. Every new sample had to be newer than the sample that came before. But starting version 2.39, we added an experimental support in Prometheus where you can say if the out of order sample or the old sample that comes within, let's say two hours in this configured example, we can ingest it in Prometheus right now. So Prometheus no longer will reject out of order samples and we can configure it to have, let's say a hundred year old sample and it will still work fine. And this is the last thing that I'm going to talk about in the deep dive, which I am very excited about. In the data model, you might have seen that a sample has a timestamp as int and a value as float 64. And we use this time series model to represent a histogram. A histogram is a distribution of values within buckets, for example, 100 counts within the number zero to one second and things like that. And for every bucket, we had a new time series and we had a separate time series to denote the overall sum, overall count of the histogram. But with these native histograms, which I just merged yesterday and will be part of release 2.40 next month, we are replacing the float with a complex data structure which stores histograms in a sparse fashion, in a nicely compressed fashion where it will be a single time series and hundreds of buckets without compromising on the storage efficiency. So this is coming next month. So with this, I would like to open the floor for Q&A. I think we have roughly 13 minutes for Q&A. Hi, thank you for your explanation all of them. And then I'm happy to use that as a Prometheus. And recently I found that there's a Kuba Prometheus stacks and ham chart. I'm deployed it. There's some secret name is a little bit strange. So I opened some issue, but nobody check in. So could you check the issue that? The issue number is the, I'm really sorry for that, but the issue is nobody see this though. Issue number is the 2,533. So could you look at it if you have a time? Thank you and sorry. Yes. Also while I walk back there, I happen to be a maintainer of the Kuba Prometheus project. So the helm chart is not maintained by the maintainers of the Kuba Prometheus stack. So just wanted to say that. Hello. I had an issue in production a little bit ago where we had a developer wanted to add some custom metrics to their app, which is yay, fantastic. He didn't really have a lot of oversight when doing it. So he added a label for the path of the request coming in. The problem was some of the paths had unique IDs in them. This went to prod after about a week, we ended up with a label that had a cardinality of like 500,000. As you can imagine, Prometheus did not handle that well. Well, actually I'm not sure if Prometheus didn't handle it well or Cluster didn't handle it well. Are there any, I guess I'm not aware of it. Maybe there is a simple answer. Is there a way to kind of put some guardrails around handling labels and high cardinality like that? Or is that just a, hey, don't do that again? So we have something called relabeling rules on scrape time, but for the issue that you mentioned, I don't think you can foresee the issue happening, okay. So we now have scrape-level limits where you can limit the number of series in a scrape and you can be like, okay, if somebody's sending me more than 5,000 series or 10,000 series, depends on the scrape config, you can just, the scrape will just fail. And then the developers will come to you and say, I can't see my metrics and they can go to the targets page and see, oh, I'm pushing a lot of data. Probably garbage. So we have all those limits in place. They're not part of the global, so you have to add it to each scrape config, but I have a PR out that I did during the contrast today, which is going to add it to global as well. Sorry, what was that called again? They're called, I don't know, but if you go to scrape config and search for limit, there'll be a bunch of limits in there. Great, thank you. It's sample limit. Is there any performance trade-off with using the out-of-order write feature that you mentioned? Does it use more resources because it has to cash in memory? Yeah, so we have tested this in production for a bunch of different loads. Depending on the rate at which you are getting out-of-order samples, there will be a bit of increase in the CPU. And the memory, I don't think it affects a lot, but again, it all depends on the percentage of samples, the percentage of time series that are getting out-of-order samples and the rate at which you are getting out-of-order samples. But still, the memory will be capped up to some point because once a time series gets up to 30 out-of-order samples, we just flush it to the disk and cap the memory consumption of a particular time series. So you have to just try it out and see how it performs in your environment. So there can be no difference, there can be a little higher difference, so it all depends. Thank you. So when you're talking about the long-term support, the LTS, you are specifying about specific versions, right? Is it that only in those versions you get the critical bug fix or I was hearing something like that? Okay, so the latest release which is out will always get the critical bug fixes and the long-term release is like, for example, 2.37 which went out in July and which we plan to support up to Jan. It will get critical bug fixes up to Jan until the next LTS release is released. And top of that, the latest release which is out right now is 2.39. That'll also get the critical bug fix. So it is like a patch release? Yes, patch release. It won't get new features, it won't get any experimental features or bug fixes to them, but it'll get critical fixes. By the way, we also have a Prometheus Quarterly community call and we discuss all the latest and greatest things there and you can always, always bring your questions and directly talk to the maintainers and community members in the community call. Just Google for it, Prometheus community call and you'll find it. And we also have monthly dev summits where we discuss all the development that's happening in Prometheus and we openly take decision on what we want to do next. And if you have anything in your mind that you want to propose, the doc is public so you can also search for Prometheus dev summits and take part in them every month. Hey, so it's been a while since I last checked, I think it was like last month, but I ran into an issue and it was, I'll give some backstory here. If anyone has ran Istio and tried to scrape Istio metrics, you know that the cardinality is a little ridiculous, especially when you're doing a pod monitor with sidecars, right? So in one particular cluster that we have, we run around to maybe 300 clusters. Our setup is we have Prometheus on everything, everything remote writes to Mamiir, right? And kudos to Mamiir. I haven't been able to break it until this thing happened. So what essentially happened was it was, it was my fault because I misconfigured the scrape config or the remote write config. So I didn't allow it enough capacity to write out. So basically it was scraping too much and it couldn't keep up with the load that was getting because my max shards wasn't high enough and I did some testing in my max requested shards. It wanted like 30 million, right? But because of this, it wasn't writing in time and stuff was getting out of date and that caused a cascading failure across all of my clusters, Prometheus nodes or Prometheus pods, right? So now all of the pods are finally like trying to send these metrics. They're failing. Mamiir is trying to ingest the metrics but it's sending, I think I can't remember the error. It goes like 490 or something like that. And basically this is the only thing that I've had that's ever taken down Mamiir, which is a surprise. But my thing is, is there a possibility that we can input logic into the remote write config that if it's out of date for a specific amount of time to just don't even try sending it and just drop it right away? Like, because I feel like, it's the only solution to this particular issue and it's lucky because I don't really care too much about the data on the pods. So I'm just using empty DIRS. So what I did was I just ran a script through and just deleted all of the pods and had the staple set restart up and you had a fresh write ahead log. But if I was running PVs on there, then we would still run into the same issue and it'd be a huge problem. So again, I remember there being an issue about this. I haven't checked in about like a month and a half. But is that being looked at or is that a possibility that we can implement into this? Yeah, I mean, so first thing is you can have an alert that says if I'm out of sync for more than 60 seconds, page me. That's first thing I would say. And I would run alerting on Prometheus, not on Mimir because if Mimir doesn't get the data, you can't alert. It's more robust to always do alerting on Prometheus. That's one layer of things. And yeah, I mean, like, I think you can do a few things. We want to improve this in a lot of different ways, but I would have that alert and act on it directly and hopefully not rely on this architecture that causes a cascading failure. Right, yeah, so if anyone's curious, the fix to it was increasing the capacity and also the maximum amount of shard so it didn't actually fall behind and cause a cascading failure. Yes, and also if you can also increase the batch size that helps me to tweak. We may in remote right work for 99% of the use cases, but like for those small use cases, you need to tweak and tweak. Maybe we can do some auto tweaking, but like for now we're gonna stick to what it is. Okay, yeah, and another question too. When it comes to the scrape configs themselves, there's the relabeling configs and the metrics relabeling configs, but there's also the remote right relabeling config. When it comes to service monitors and pod monitors, right? It would be nice if like, we can kind of have like the best of both worlds of it storing some data locally, but also like the stuff that like is really like heavy, like memory usage wise, we send it, but don't store it locally. If that makes sense, kind of like we do, you do a metrics relabeling to keep certain metrics, but then you would have to do like at a global level, stop like the remote right for some metrics, but if we could put that into another section of like the pod monitor or the service monitor is like remote right relabeling or something like that, I think that would be really, is there other plans for that or is it just going to be set on the global level? I don't think there are any plans for that, but it's a feature request, I would say that I haven't seen, so just open an issue for it. We still have a few minutes, but let's try to keep the questions short. We will always be here, all the Prometheus people, so you can always come to us and discuss also afterwards. So yeah, we have a couple of questions more I think and we have a few minutes more. Hi, this is relating to cardinality. I guess there's no support for string metrics, where the value is a string, right? Everything else is a label. I have today, if I want to keep track of a string, I got to put it as a label, which increases the cardinality, right? Could we add string metrics? It's again a feature request, I'm not sure if it will be in the plan anytime soon, but we'll consider it. Thank you. We are a metrics monitoring system, not a logs monitoring system, sir. But yes, with the histogram stuff, we can make it a little more generic. Maybe in the future we'll support it, but no plans as of today. Hey, I've heard about some updates with PromQL language server, as well as another project called PromLens, and I noticed some difficulty that folks have is with learning PromQL, particularly with our developers. And I was just wondering if there's any plans to integrate that with Prometheus UI or Grafana if it's not already there, but yeah, yeah. Looking to hear if there's any status updates on that. By the way, PromLens was open sourced, I think two days ago, with the collaboration from the PromLabs, I don't know what Julius is companies, PromLabs and Pronosphere, and we have full plans to integrate it into Prometheus. It's part of the, it's in the Prometheus org right now, and you should see it in the coming months. Yeah. Cool. Any more questions? Yes, we have a couple more. Yeah, one more minute. Who do you think was first? One at the back. Sorry. Sorry. I know you guys didn't really mention the subject, and I'm not really up to date, but could you elaborate a little bit more on native high availability in Prometheus and what things you specifically don't want and what your roadmap is? So Prometheus is highly available in that you run two Prometheus servers that don't talk to each other, that don't know anything about each other, and you can do deduplication of alerting at the alert manager level, which is highly available. So I would say Prometheus does solve the high availability problem in its own way. It's a very workable solution that works for most people. Prometheus is inexpensive enough that you can run two Prometheus servers and be happy about it. If you go to a really distributed, highly available system, you'll probably have to run three replicas, please, that talk to each other, there's a network dependency, there's split drain problems. We don't want to go into all of that. We want to build a very robust server for alerting and we don't want to complicate the architecture. But if you do want a scalable, highly available system, there's other projects in the ecosystem like Cortex, Thanos, Mimir, and stuff that you can look into, but Prometheus itself will be a single node system that where the high availability is run two copies of it. Thanks. Again, even for the customers of Mimir, I always suggest run alerting in Prometheus, mainly because it's far more robust than running it in Mimir. Like they mentioned, it's very common for the data to lag behind. When I say very common, it happens once in three or four months, but that's too frequent but if you're relying on alerting. Yeah. Do we have time for one more? Yeah, I think so. Do we? Yeah, okay, we have time. You talked about scraping as the only way to push metrics into the TSTP. What about things like OTLP, like Open Telemetry line protocol too? So Ganesh also talked about the remote write receiver. You can push remote write into it. And in the collector, I gave a talk on Monday. It should be up already, I think. So basically, you can use the collector to write OTLP into the collector and then write remote write into Prometheus. So you can convert OTLP data into Prometheus and maps really, really well. Having said that, I do, I'm also part of the Open Telemetry community and I always keep proposing adding OTLP ingestion. So there's a dev summit in a couple of weeks that I'm gonna propose again. So let's see what the Prometheus community says then. But yeah, today if you wanna do that, you can do it through the collector. There's full support that works really well. All right, I think that's time. But we will be here. So you can, there's other maintenance here so you can always come talk to us.