 Hello, everyone. Welcome to KubeCon. I hope you're all having a great time. Welcome to this roundtable discussion on the evolution of metric monitoring and alerting via Prometheus. My name is Kunal and I have been contributing to open source ever since I was in my freshman year. Currently, I'm a CNCF intern working on Thanos, and I'm really excited for this particular talk. Let's just start by introducing the speakers that we have for today's panel. Why don't we start with Julius? Hi there. I'm Julius. I co-founded the Prometheus monitoring system back in 2012 at SoundCloud. The last years, I spent freelancing around it, just helping companies build stuff on top of it, use it, get training, etc. This year, I created my own Prometheus-related company called Promlabs. So basically continuing what I've been doing, but also building commercial software on top like Promlabs. If you need that or any help with Prometheus, please reach out. Yeah. Hello. My name is Björn. I have been with Prometheus almost from the beginning. That's a fun story because I watched Julius and Matt Proud, his co-founder, creating it. I thought it's a very weird idea and nobody will ever use this, but they still lured me over. I was sitting in Berlin as one of the only Google engineers who could work from Berlin back then. Yeah, Julius was there at SoundCloud. They invited me for beers and told me the story. At some point, I don't know if it was just Prometheus or just because they're nice people to work with. I came over and then that was 2013. That's long ago now, and we were starting to work on this, and I have never stopped working on Prometheus ever since. Now I'm here. Hi, I'm Richard. I joined Prometheus in 2015. I was looking for a new monitoring solution for a network, for an ISP, but the data center also monitored it. The data center with Prometheus basically fell in love with Prometheus immediately because it's just more powerful and you can do extra math on your monitoring data, and I've been around ever since. Yeah. My name is Bartek, and I'm a software engineer at Red Hat currently. We are doing everything, observability, mainly on OpenShift. However, my journey with Prometheus started four years ago when I was in my previous job as SRE, and we were doing the very beginning rotations between teams, and I was supposed to do game engine, and I ended up actually exposing metrics and doing observability with Prometheus, and I fell in love in that. This is how my journey started, and then I become Prometheus maintainer and we started Thanos as well on the way. Thanks a lot for the introduction, everyone, and before we move forward with like some more, like we'll be talking about what Prometheus is and like how it started. So it's, even if you're a beginner in Prometheus or someone who is much more experienced, there's a base of both worlds that you can get out of it, but before moving on to all of that, I have just the first question that I have is, what actually is Prometheus and how did it start? So Prometheus is a metrics-based monitoring system that includes a time series database that allows you to monitor software, devices, anything that you can really get numeric metrics out of, and then either make nice dashboards based on the collected data or also integrate that data into your alerts to wake yourself up at night, hopefully not at night, to fix a problem and detect these problems. And yeah, how it started. So I found myself in my dream job at Google in 2012, being a site reliability engineer in Zurich, working to keep one of Google's services online and working, and there at Google, we had a great monitoring tool called Borgmon. Everyone hated it, but everyone also said it's the worst tool except for everything else. And when I then left Google to go to SoundCloud to Berlin, ironically, actually, I really liked the job at Google, but I really wanted to just go back to my hometown of Berlin after so many years. I was at SoundCloud and another ex-Googler also joined Met Prout, and we were both really missing the kind of monitoring tool, Borgmon that Google had, and we looked at everything in the open source world back then, and we're really not happy with the data model, the UI, the query languages, or complete lack thereof, the efficiency in storage, being able to deal with dynamic environments. So SoundCloud back then already had built the complete own cluster scheduler before Docker existed, before Kubernetes existed many years before that, and there were hundreds of microservices running on that thing with thousands of instances, and it was basically impossible to find out when there was a problem, like a latency spike. Was it all of the instances, or just one, like what's really happening in detail on that cluster? So we were really convinced that we should at least try to build something that is inspired by Borgmon to help both our own job, but also create something for the open source world that we could use in our next jobs and that others could use, and yeah, we basically started in our free time, and eventually after it became useful enough, we started introducing it at SoundCloud, and it was still a long path to actually make it work and convince people. I remember talking to Björn initially, I think in an elevator, and he was like, what, you're recreating Borgmon? This is totally crazy. Like, why would you do that? And I think, yeah, now he's happy. And yeah, but that really helped in the end get a totally different level of insight into what was happening with the microservices in that cluster, get way more precise alerting, and helped teams work in a totally different way than before, like even testing their own software when it got released and so on. And yeah, then we really finally published it properly with a blog post and everything from SoundCloud and others in 2015, and we didn't really expect it to, you know, we didn't really expect anyone to really get what we're doing because Prometheus does a lot of things differently than other monitoring systems before it with this arcane-ish query language pull model instead of push for the actual data collection. And like many little other things, but I think it really hit the nerve at that time with people starting to adopt Kubernetes, needing something to monitor stuff running on top of it. And Prometheus really integrating well with dynamic environments by services discovery, which we might get into later. The rest of the story basically, you know, like it got really large adoption from there. We were really happy. And a year later, roughly, we joined the CNCF. And yeah, by now we are happy, a CNCF project. And for some reason became basically the factor standard in open-source monitoring, which is amazing, but was totally not what we set out to do initially. That is in fact really amazing. And like Julius mentioned something about the time series. So like, can someone give a bit more insights about the time series for our viewers? And like, how does it sort of compare from something traditional like Nagios? Yeah, I mean, I guess I can take that. So Nagios was kind of the gold standard in whatever 2012, 2013, when Prometheus started. And I mean, it's many people are still probably using it and it's doing a good job in certain traditional scenarios. But there is this one thing that Nagios essentially is your run checks and if they fail, you get an alert. It's kind of this binary thing. It fails or not. The only kind of thing you can do is that you count checks and if like three out of five fail, then you alert or something like this. It has a bit of a time dimension, but there is no real time series in there, right? And this was, I mean, that existed a bit with like stats and graphite, which was fairly new at this time as well. I mean, it paved the road already and kudos for that, definitely. But then people always thought about dashboards, right? You write lines and onto a dashboard. But like, yeah, this vision of doing this all in one and have alerting combined with that, that was a really, really important paradigm shift. We usually tell people the cool thing is that you can do, like it's called trending, I think, where you can say, okay, I don't have to alert if I cross a threshold, if the disk is full, I can actually alert if it's getting full, right? So this will allow me to alert way earlier if there's a steep increase in disk usage. And I can totally write through like 90% disk usage if this is just constant, it might be fine, right? So this is the one thing. There's also more to that to make alerts more meaningful, like if you alert on like a curing system being behind, you can actually stop alerting if the queue is getting shorter because it's still fairly long and that's bad, but you know the system is already recovering so you don't need to alert. So that was the one big thing. You could also, by time series, you could have this whole notion of a counter where like, I mean, let's say Apache, like ancient Apache already had like this kind of metrics module where it would tell you how many requests it was serving per second and how many it has served altogether. And this is essentially redundant information if you just record and at times of your database how this counter evolves, then you can take a rate, like differentiate the counter and you get the rate and you can actually define if you want the rate over the last minute or last 10 minutes. It's so much more powerful, but it requires you to record this as a time series. And that's what Prometheus did and it unlocked an enormous amount of things you could do. And it was not just for drawing lines in a dashboard, it was also for alerting. And the other thing is that we label those times series and that was where we went from like those dotted hierarchical strings from graph rights, that's the two, a label model, which is like non-hierarchical and you can slice and dice along all these dimensions. That's really important what Judith said to really go down to certain failure cases or root causes. And this was such a nice coincidence that Kubernetes also labels everything, right? So Kubernetes came out when we were already done with the Prometheus prototype essentially and the Kubernetes people, they could have noticed what we're doing with Prometheus, but they didn't know it. It was an open repository, but nobody knew it. And then it came out, it both was like Greek words, Kubernetes and Prometheus, they are both 10 letter words. So you could say K8S and P8S. The logos are like exactly complementary colors, orange and blue. So for anybody outside, it looked like this was designed on purpose to work together and to fit together, but it's not true. It's like really converging evolution, essentially. And also like you can do a lot of Prometheus monitoring for non-Kubernetes systems. Like at SoundCloud, we did it for our own homegrown orchestration platform, but you can do it, use it for everything. It's not, it's really separate developments, but it played perfectly together. And that's also, I guess, the reason why the first two projects in the CNCF are Kubernetes and Prometheus. Absolutely, and yeah, go ahead. So I just want to make one shout out to Sevix, because even back then, it was doing time series. It was pole-based, at least mainly. You could also do push. It allowed you to do basic math. Still, Prometheus blew it out of the water. But already back then, you had something which was, it has all the main things, but not as nicely integrated as Prometheus did. Yeah, absolutely. That was a really good explanation for why Prometheus started, what it is. But I'm currently like, let's say I want to use my, let's say I have a Java application. So if I want to use Prometheus, then how do I actually send the metrics to Prometheus? So how does that usually work? And that's why Prometheus is very unique in this field, because you don't actually send metrics to Prometheus. Prometheus collects that metric value from your application. And this is the sometimes controversial discussion between pull and push model. So Prometheus is primarily a pull model here. It allows you to configure certain scrape targets and certain interval with how often you collect certain metrics. And with simple HTTP endpoint page that you expose in each of your applications, you just point Prometheus to those endpoints and Prometheus will collect this data from those endpoints periodically. And this is essentially how Prometheus collects in the regular intervals the values of your metrics that are exposed within your application. And that was pretty novel back then when Prometheus started because everyone used to something called like black box or close box monitoring where, for example, in Nagios, you are creating scripts around your system that checks from outside what is happening within your application. Right now, there is this direction of exposing more observability kind of signals within your application. So maybe you can count your queue sizes, HTTP requests and server latency, and stuff like that, which is extremely important. And you do that by not pushing this data to some monitoring system, but actually just allowing some other system to collect it. And Prometheus leverage that. This, well, this might mean that the code has to be instrumented, like you need to use some client library and there are plenty of libraries for the major languages available and supported by either community or Prometheus maintainers. But at the end with this, it's much, much easier and more efficient than pushing metrics because you don't need to worry from the client perspective on the application into how to handle failover scenarios, rate limits, retries, how you do your buffer this data. Essentially, you can perform pretty stateless applications and have reliable monitoring on top of that. This also allows you to simplify discovery and configuration, so make it kind of top-down discovery. And I think Julius, you can tell more about that. Yeah, so once you have those applications that have instrumentation, like an HTTP endpoint where Prometheus can pull metrics from, the next question is, how does Prometheus know where it should pull from? Obviously, the simplest way would be to just statically configure some endpoints in your Prometheus configuration, but that worked maybe 20 years ago when you have these static database servers and a web server and they never change. Nowadays, you have cloud instances popping up, going down, you have Kubernetes on top, you have on top of that Kubernetes like many changing microservice instances. And the key here is service discovery. So Prometheus can integrate with different types of service discovery in your infrastructure. The most prominent one would be the Kubernetes service discovery where it continuously talks to the Kubernetes API server to get a constantly updated view of what should exist in the world, right? So for a monitoring system, it's really important to know what should be there and what is it. And so Prometheus uses service discovery information. Let's go with the Kubernetes example, but there's others to know what should be there, how do I pull from it? And then also enrich the pulled time series data with information it got from service discovery. So it will know what kind of pod it's pulling from in Kubernetes, for example, which kind of environment is in what's the pod name, et cetera, et cetera. And yeah, that just really helps Prometheus deal with these dynamic environments and allows you to define alerts when Prometheus is trying to pull from something that currently should be there, but isn't, because Prometheus will also notice that automatically. And yeah, so there are service discoveries integrated in Prometheus for instances of the different cloud providers for Kubernetes, Mesos, DNS, ZooKeeper and some other ones. And there's also an interface to plug in your custom one. And potentially if you are missing a built-in service discovery in Prometheus that is for something really popular, we might be able to include it in Prometheus itself. And this also has a political angle, of course, while we have that system which allows you to nicely transport your metric data, this is inherently tied to the name of Prometheus. And we got hundreds back then, thousands these days of integrations and exporters from people running their own stuff or instrumenting their own stuff, writing their own exporters, of course they needed it for their own things, yet there was this political angle of completing projects or competing vendors not wanting to report something which carried the name of Prometheus. Which is why Open Metrics was started, which is basically an effort to standardize the Prometheus exposition format with some slight changes. And it's been, that concept has been in use for like five years now. We are actually, and I'm really meaning this, we are close to publishing the actual draft within the ITF. Just yesterday night we finished the to-do list. So we have the final publishing. So what is this? It's basically just taking Prometheus exposition format, taking all that goodness, putting it into its own thing so you don't have that political angle of supporting something from a competing product, while also allowing you to have an official standard within ITF where you can just say, okay, RFC one, two, three, please support us. Which is especially important when you come to networking hardware or more traditional vendors. Of course they usually work on ITF standards, at least in the networking space. So this is why I started this because it's just important to not only have Prometheus as this super nice, efficient, powerful data engine and framework for doing your observability, but also permeating this concept of label-based metrics throughout the whole ecosystem, throughout this industry and other industries. And that's basically the intention behind this. And I think it's kind of working, which is super nice. And again, within weeks we will actually publish the standard, no, really, yay. Yeah, that was, that's really helpful, but like Bartek mentioned about the pull versus push mechanism. So what if the question that I have, even from the viewer's point of view, so how do I actually fetch the data and actually visualize the data that I have collected? So how does that usually go? Yeah, so there is a query language called PromQL in Prometheus, which really forms the heart of building alerts, building dashboards, but also doing ad hoc debugging, digging around in your data. And the query language was initially inspired by what we were used to in Pokemon at Google. And now it's, of course, not exactly the same, but it uses similar principles. And it is not a SQL-like language like you would find in some other time series databases, but a more functional language, which just allows you to select data and then wrap more and more and more transformations around the selected data. And I think some of the core features that really make it powerful is that you can do math between whole sets of time series, for example, dividing a whole set of error rates by a whole set of total rates, for example, automatically joining them on either completely identical label sets or related label sets, so there's modifiers you can use. So it allows a level of insight and math between time series that wasn't really seen before that. It is admittedly, it has some sharp edges and is different from what people are used to in previous languages, but I think it really pays off and it allows you to do very precise alerts as well. Yeah, thanks, Julius. And what about visualizing the data that I have collected? So there's several approaches to this. Obviously, you have the UI built into Premiere itself, which is super nice for exploring your data and which I still use quite often because it's just very snappy and nice, but there's also something which is called Prom Dash, which was the dashboarding tool for Premiere DS. And it is an absolute pain to use and to create dashboards with Prom Dash. I never liked it. And just by happenstance, also around the same time, 2014, I think it was, Torql founded or started the Grafana project initially to visualize graphite data, but there was a plugin for Prometheus as well within a few months, I think. And basically, as I was mentioning, I was trying to find new monitoring for my internet or for my backbone. And was looking at both Grafana and Prometheus at the same time and just kept using the tool with each other while mainly ignoring Prom Dash, because it was just so painful. And at some point, talking with Carl, I just realized, hey, why don't we try and make Grafana the actual default recommendation for Prometheus, which we discussed with him. I don't think Prometheus team existed back then, but within the group which was around Prometheus, we discussed this and basically decided that, yes, this is not the focus of Prometheus anyway, and having something which is taking care of actually the visualization part for fixed dashboards is absolutely nice. And something which helps the project in as much as we don't have to focus on this. Of course, these days, Grafana has the Explorer UI, which basically mirrors and even enhances what you have with in Prometheus proper, which doesn't mean that doing it with in Prometheus is still not also fully okay and sometimes the quickest way to get something done. But for dashboards, basically, we switched the official recommendation towards Grafana in either 2015 or 2016, I think. I think 15. Yeah, thanks a lot, Richie. I do have a follow-up question on that. Like when we're using like monitoring and alerting tools, one of the reasons and one of the most advantages that we can take off is figuring out and detecting the incidents that might cause some damages. So do I need to glare at these dashboards to detect those incidents or is there another better way to do that? No, that's exactly what was one of the primary goals of Prometheus. Prometheus was not only about dashboarding, graphing those data, it's really about alerting and kind of reactive work and focusing on incidents only when, and focusing on running workloads only when there is something wrong going on. So this is kind of a natural evolution when someone introduced a monitoring. First, there was no monitoring. Then, for example, Prometheus is introduced and they try to visualize some data, some health of the system. And then you set up alerts so you can actually stop looking on those health systems to monitor yourself. You can actually let Prometheus to monitor for you. And this is really, really kind of novel and amazing and kind of the way going forward because you can, as an operator, DevOps, SRE, you can focus on feature work, expanding your business instead of operating and manually watching health of your system. And you can, you know, configure, let's say, rules that will trigger an alert that will be sent to Alert Manager which is part of the Prometheus ecosystem. And Alert Manager allows you to route to the notification system of your choice. It can route the page of duty under kind of systems that allows you to send message to your phone, Slack, whatever, email, JIRA. And what is also revolutionary here is that you can actually trigger an alert on the symptom of potential incident that will happen soon, potentially. For example, you know, disk space is running low or you predict that the CPU saturation will happen soon because there is like growing increase and nothing is going down or maybe memory utilization is constantly growing. So you predict that will you run out of the memory soon. So you can actually predict that ahead of time and essentially react faster. But the true kind of value in truly automated kind of system infrastructure monitoring is kind of SLO based alerts. And this is kind of the huge ecosystem and lots of talks around tools around that. I really recommend the tool that wrote Matthias, my coworker, it's called Prometheus and it lets you generate error budget based SLOs which you define the service level objective that you care for because you don't care if one user maybe request fails. You care if a significant amount of users cannot reach your service for example for some amount of time that is maybe described in the contract. So you can alert, you can adjust the alerting to match those contract behavior and also only be notified for human response only if very, very needed to avoid noise. Yeah, that answers my question. Thanks a lot for sharing. And I just have one more question for Julius. Like what was the naming inspiration behind Prometheus? Like we know for Thanos it's Marvel Avengers and stuff like that, there's Loki and what was the inspiration behind Prometheus' name? Well, I mean, to be honest, we initially we just needed a name. So we did what everyone does. We went through Wikipedia lists of Greek gods and goddesses and all titans and all that. And eventually we did stumble over Prometheus and A, it wasn't taken yet in that relevant space. We could even get the GitHub org, of course it existed already at the time but it was not used. So GitHub actually gave it to us. But also we just noticed like the symbolism fit nicely. A, we kind of Prometheus stole fire from the gods and brought it to the humans, the outside world in a way I wouldn't say. Okay, I would say we were inspired by Google's Borgman and built something like that for the open world. And the second meaning is the fire, right? Like a monitoring system gives you insight as does fire. So it's illuminating things and the torch is a nice symbol for that. So yeah, it worked out really well. And the fact that then all these other cloud native tools also ended up with Greek names like Kubernetes and Istio and so on, that was a complete coincidence. Alrighty, yeah. Thanks a lot everyone for joining and everyone who is watching this and will be also available to answer any questions so to make sure that if you have any questions, any follow up questions, you can ask those and have a great KubeCon. Thank you, goodbye. Thank you. Bye-bye. Bye-bye.