 So, if you have any questions about Prometheus and or Grafana, this is now a Q&A session because reasons. Does anyone have questions? Sure. Testing, testing. So, any questions for Prometheus? As a user of InfluxDB, short explanation difference between Influx and Prometheus. So, the question is the difference between InfluxDB and Prometheus. Let's take it. So, I guess the main difference is that Prometheus is a full monitoring system which comes basically with end-to-end. So, we have instrumentation libraries that you can use for instrumenting your own code. Then we have exporters for a bunch of different tools out there like MySQL. So, that's the instrumentation side. That's covered by Prometheus. And then we have Prometheus itself, which is a monitoring system which includes a time series database as well as actually the collection. So, it can talk to services, covering systems to detect what is out there to be monitored and then fetch the instrumented metrics, right? And then you have the query language on top and you have the alerting on top and you have the alert management. So, basically the whole chain from instrumentation up until graphing and alerting is covered by Prometheus whereas InfluxDB is first and foremost time series database. Of course, they have also tools, but they are sort of more in a stand-alone manner, I would say. So, I would say it's like, you know, if we're going to try and compare apples to apples we need to compare Telegraph, Influx and Capacitor to Prometheus' exporters. Not the client libraries because Influx doesn't have those. And then you've got Prometheus and then you've got the alert manager because that's the fair comparison because those are roughly the same things. And there's architectural differences and so on. The way I see Influx is Prometheus is a metric system. You just saw my talk, those of you who are here. Metrics give you breadth, not depth. Influx, I see, is actually an event logging system, so it's a log system. It's actually also fairly good at doing metrics. Like, it can ingest graphic data, but it fundamentally is an event logging system, not a metric system. So, that's kind of where I see the difference. And then, like, Capacitor, it is fairly powerful, but all the things people say about Prometheus, it doesn't scale horizontally. It turns out, yeah, Capacitor uses the exact same way of scaling out that Prometheus does, which is you get to vertically shard, which I find amusing. So, that's the way I see it. If you're looking at something like, say, IoT, the obvious choice there is Influx, all of them equal because that's generally tend to be event logging. If you're looking for a general monitoring system, you know, I would suggest Prometheus, but I am a Prometheus developer. That's the way I would see things. Next question. About Allerton in the latest Rafaana, I want to have some template for many, many servers and have at least, when I receive alert notification, I want to see the host name. What host name is Allerton? Is it possible to make... I don't think I quite understand the question. I think the question is, he has a lot of dashboards, and when he has alerts, he wants to know what host name created the alert. So, alerts in Rafaana right now only support serial names and not tags like server node or something like that. Eventually, the alerting will also support tags so you can include node or serial name or data center or something like that. But right now you would have to format the legend to contain the server name or data center or something like that. I'm not sure if that answers your question. You could use a wild card query saying in Graphite you would place a star where the server is, and then only the series violating the threshold will be included in the alert. So if you have 10,000 servers and two of them have high CPU or something that you want to alert on, then only those series will be present in the alert. I have a question regarding Graphana Alerting as well. What is the roadmap for the development of distributed alerts in Graphana? The roadmap will be presented later. Yeah, I don't want to take bits of my talk off right now. If that's okay. You should also say if there have to be other monitoring system alters here as well. This is the monitoring track, not the Prometheus and Graphana track, so come down. Inflex DB has pretty weird attention schema, and Graphana doesn't work with these retentions with Inflex DB. Who will solve this problem when you want to make some aggregation of all the metrics and to see them with the same dashboard instead of creating different dashboards for different temperatures? So the question was about retention policies in Inflex DB and in Graphana, right? So we have been talking with the Inflex data team, and we have not found a common way of making that configurable in some way. So right now you would have to configure each dashboard for certain policies, and that's very cumbersome. You could use interval templates or something like that for the policy selection, but there's no automated way. Hello, yeah. What's the ideal retention for Prometheus? I heard one time that's only a few days or a few weeks, I don't know. Either unlimited retention for Prometheus, or do we need to put in place a strategy to backup the data? So the question is Prometheus and long-term data storage, basically. So the thing is that Prometheus is designed for reliability, which is different from durability. And basically if you want to have infinite retention, that means you're basically creating a distributed storage system, which is really, really, really hard and in the context of monitoring unreliable. Because you want a monitoring system that if your network starts falling apart, it still keeps on working. So our approach to this is to see the storage for Prometheus more as a cache, which might be a few weeks of the cache, and then there's something else which has that at long-term. So our strategy is, because basically we don't have time to build one of these ourselves, we're busy with other stuff, that we will have interfaces out. So there's already the write code, and it's been there a few months now. So Prometheus can write out to another system, and at the moment there's Weaverworks Cortex, which is an open source, that can accept this data. And there's other systems as well which can build there. And we're also, at some point, going to add the read path as well, which can read the data back in, so you can transparently access your data. And then you can basically have a separation of concerns that all your alerts are based on the cache, which is all local, all inside Prometheus, all reliable. And if you want to go back more than a few weeks or whatnot, you can have the relatively unreliable query to your long-term storage. And if that's broken, well, you've just lost graphing until you fix that, but you haven't affected your core alerting abilities. But the data is that this will be a separation of concerns between Prometheus and the long-term storage, because we don't want that sort of CP system with strong consistency when we care about availability for, well, alerting. Because I shouldn't use the word monitoring after my own talk. But alerting and critical dashboards, but not trending. Next question? Never. When I raise the mic, you can just raise your hand then, because that saves time. And just to be clear, Dave, we had to do the projector working again. Because I've had time slots work, we need to use up time. If you have a long-term matrix storage, you have to build the system for analytics, because with a long-term storage, large-scale analytics does not make sense, no? Can you repeat the question a little closer to Matt? Ah, okay. With a long-term storage, without mass analytics, with a map reducer or a framework, it doesn't make sense, because, okay, you feel with the hard drive, but you cannot have the capability to retrieve it and analyze it efficiently, no? I think the question was that only having long-term storage without having the analytical capability of Prometheus wouldn't work. And the answer is, you shouldn't own... Is this good? Can you please stop talking or leave? Thank you. So basically the question was, if I write out over this remote wide pass, how do I read back the data? Was that the question? Okay, so basically the answer right now is that we have remote storage APIs to arbitrary systems, but there's no read back API yet at this point. So if you can build your own on your own remote storage integration, but there's no generic read pass at this point, but there will be hopefully within this year. Is the long-time goal to work side-by-side with traditional monitoring and alerting systems like iSinger, or is the long-term goal to replace these tools? I would pretty clearly say it's the goal to replace these, yeah. I mean, I think it's mostly focused on checks and alerting, so my talk kind of goes into this, like, what are our approaches to sort of solve this problem in a different way. So if you want to come around later for this talk, you will see more. So I have to see your talk, okay? Yeah, I guess. Does your talk also cover multi-data center application monitoring? I mean, in theory, you will see that the concept is kind of agnostic or that, right? So it would totally work. In one of the sentences, the basic idea is we collect time series data, and then we have sort of our state of truth in terms of sample data, and then we can define alerting on top of that, and about what we collect this data and how we collect it is completely separated from the alerting layer. That's sort of the short story. The alerting layer is what's mostly interesting in this, because what you have is multi-data centers hosting one application, and then how do you alert on what's going wrong on there? Yeah, I mean, so Prometheus is collecting time series data, right? So Prometheus can monitor multiple data centers. There are like patterns how to set it up, and then we have time series data, and this is completely separated from it, and on top of that, we define alerting. So as soon as you have a monitoring system that can collect time series data in multi-data center sort of layouts, you can also load on it. So a thing with Prometheus is if you can graph on it, you can alert on it, and you can certainly graph multi-data center metrics like latency and so on. There are some reliability questions about, you know, being tightly bound to those data centers being up in order to be able to be alerting. So I would generally advise as much as possible, push the alerts down to the data center level to avoid the wan links being on your critical paths for alerts. But if you want to alert on like global latency, you can. Just be aware of reliability implications that depending on how you do that, you might get no alerts if your network's dodgy. What's your question in the middle here? So extracting metrics from logs obviously isn't ideal, like logs are logs, metrics are metrics to your earlier point. But in terms of support for doing stuff like that anyway, these support tools like LogStash and that kind of stuff as well. So LogStash is just a transport layer, as fluent it is. So this moves data from A to B. So as I said in my talk, logs, they're great for breadth, sorry for depth, and go down to breadth by depth, you'll end up with much information. But for Prometheus, two tools have been written called the Grock Exporter and M-Tail, Google's M-Tail, which can do that. And Fluenty might also be getting support to extract some metrics from logs. Because my stance on it is like, if you want logs, use logs. If you want metrics, use metrics. If you only have logs, well, at least try to get some metrics on them so you can fix things to be proper. Because let's be honest, the world isn't perfect. You need tools like these sometimes. And it's still the only way to discover disk failures is actually logs. We've checked Linux, at least when I checked two, three years ago, Linux did not export metrics. To let you know if your disk has failed, your only choice is looking for syslog. You know all those messages when your disk is failing? That's how you find them, only way. Another question you were saying, you like to replace traditional systems like Nagios or the place important ones like Isinga. If you have an environment which is absolutely not cloud native, and it's even not distributed to a lot of systems, but consists of industrial controllers, robots, whatever boxes, which can be checked with a small script very easily. How do you think, can you replace all this knowledge that has been built up in the last years if you can write and export this for everything? Rich, you want to answer that one? As someone who's not working at a cloud native company, yes. So if you have caching concepts in your monitoring, where you basically have some nodes which cache for higher level monitoring nodes, you have to take that into account, but it works. You just have to plan for it accordingly, but yes. So even without anything dynamic, or even with really old legacy applications, we see huge benefits with Prometheus and Grafana. Huge. So a little secret is that Prometheus doesn't know what a container is. Prometheus does not know what a cloud is. Prometheus just has labels, which you put your own ideas onto. So if you want to monitor actual hardware, you can. Like, I've got a Prometheus running at home, monitors like three machines in a switch. Don't ask why I've got a 48-port switch at home. But you can do all these things. One thing to watch, though, is the trade-offs we make are the engineering trade-offs around on-call, where losing a little bit of data is better than losing monitoring. If you're talking industrial control, like actual industrial control, I would be evaluating Prometheus very carefully to see if the trade-offs we've made, which make total sense when talking about web servers or network switches, whether those trade-offs also make sense in industrial control sense. Because we have made trade-offs there in terms of precision just to get better availability. Well, do you have any idea how memory-intensive can Prometheus get? I think I have a VPS that has a pretty limited amount of memory, and I was pretty surprised that with about 100 lines in the metrics file, it started eating like two or three gigabytes of memory that seemed quite loud. I've researched this recently. So the default for Prometheus is about four gigs of RAM, as of 1.4. These days, it's probably down to like three gigs-ish. But if you look at my blog, the succession blog, there's an article there on how much RAM does Prometheus need for ingestion, which goes into this. Yeah, also, you might have to look at the storage flex, right? There are several knobs you can configure to basically tweak the memory usage at runtime. We kind of know that's not ideal. So actually, we're working on a new storage layer. Actually, I have the benchmark running right now. So hopefully, we are going to cut down memory usage by, I don't know, like 75% or something. So if everything goes right, it will definitely go down significantly in the next months. Could the next speaker get hooked in because we've got like five minutes until the next talk? Whoever the next speaker is. I think we can fit in one more question while those next speakers get ready. So first, there was a request about putting up the slides right now, so people can look at them. We will hand them over to our back office, so to speak, and they'll put the link of the presentation into the description of the talk. So if you just want to look at the slides, you can go there. And while we're waiting, show of hands. Who is using Nagios? Who's using Isingo? Who's using Savics? Yay! Who's using Prometheus? Who's using Graphite? Who's using Grafano? Did you see that? Anything else which I should ask for? Lockstash? Greylock? That didn't work out the way you wanted it. What? RTG? Well, MRTG, yeah. With a crank on it.