 Hello everyone. It's a pleasure to be here. I'm Filippo Juncetti. I work as an operations engineer for the Wikimedia foundation and this talk would be about how we are deploying Prometheus and what the road we talk to, the fully deployed to production. So what I'm going to talk about today, if this thing works, a little bit of introduction and well, to set the stage and then what we have and what we need for Prometheus from a monitoring system in general. Why Prometheus? Why was our choice? Why not any other open source time series database out there? How what's the shape of Prometheus in production? What? How it works basically. And finally what Prometheus eventually does and will do for us in production. So well first a little bit of introduction the slide says Wikimedia foundation, but Wikipedia, Wikimedia, what's the relationship? Why the confusion? Well Wikimedia foundation is the no-profit organization that oversees and helps grow the project of the Wikimedia foundation, the Wikimedia movement and among these projects is Wikimedia, which certainly the most common and most known the English Wikimedia. So Wikimedia and the sister projects did 16 billion page view among with 13,000 new editors a month. So that's quite a large project and I'm sure everybody in this room has used Wikimedia at least once. Well across all projects we have 41 million articles at the moment and 34 million Multimedia files. Remember that all this content is freely available for everyone to consult, to change and to share. You can find more data about what I just talked about in the report card. Of course one of the values is sharing and giving every human being free access to knowledge. So one of the core values of the foundation is to share data and being as open as possible. So all the development and processes and decisions are taken in the in the open. Well, how does this actually, this thing actually works? Well, we have at the moment four sites, two data centers, two main data centers, both in the US, and two caching pops. What are caching pops? Well, who knows what a caching pop is? Resolvence, please. Okay, some of you. Is a small data center, you can think about it, where we place computers around the world to lower latency for users and the monitor is one in the west coast of the US and one in Amsterdam. We have about 1,400 bare metal machines, which we operate ourselves, including data centers. It's all it's fully operated in in-house for well to keep independence mostly. And at peak we're doing 120k, 125k, request a second, all HTTPS. Sometimes we go with all traffic, all user traffic to HTTPS. That was quite an interesting project as well. And we're pushing 32 gigabytes of peak outbound to clients. And as you can imagine, that's mostly multimedia content. This is what I just talked about. You see, you don't have to memorize these pictures, but you see ECHIAD and CoFW are the two main data centers, which and then USFO and ESMS are the two caching pops. With actually one in Asia will be coming online in the next two quarters. So how the landscape looked like, monitoring landscape at the WMF. Well, it's sort of the standard monitoring, open source monitoring stack with storing time series database and visualizing, say for example, by Ganglia. I don't want to go through the list. Some of these tools are very well known. Some of less. Anyway, over time what has happened is that we've been having monitoring tools, but we have been removing none. So this is all technical depth that kept accumulating. So what we needed was something that was powerful enough for our needs to remove some of these tools and give something more as well. So enter Prometheus. Why? Why we choose Prometheus? Well, one of the most important things that Prometheus has given us is the power of querying data and extracting, well, basically asking a useful question about all the data that we add. How many of you know what Prometheus is? Okay, a good chunk. Excellent. So basically the idea is that when you get data in Prometheus you can query it and slice and dice the data as you want. We'll see more about this later. Prometheus is a toolkit. So the idea would be that other teams outside, for example, operations itself can use Prometheus to answer their questions. So we have, being the operations team, operational needs, say, for example, bandwidth or request errors and so on, but other teams of the foundation might have different needs that also require time-serious data. Well, yes. So it's sort of related to this, the multi-tenancy. This is a separate point because from an operational perspective we are deploying multiple Prometheus servers, each independent of each other, say, for example, for team. It's been very reliable, unlike some other monitoring system that we had. And for example, in production, I don't think we had a single crash. And of course, that's very important because the obvious but true of observation is that a monitoring system should be more reliable than the system it's monitoring. Otherwise, it's basically useful, useless. Efficient resource usage, Prometheus has been very efficient in utilizing, for example, CPU, bandwidth, and more importantly, a disk for storing time-serious. And the metric flow is also easy to understand and debug. Basically, both the read path and the write path inside and outside Prometheus are HTTP-based. As you have seen a few slides ago, we do quite a few HTTP requests a second. So it's a protocol that we know and encourage more services to use, as many services as possible. So what we did before production? Well, doing due diligence, I don't want to deploy this thing to production, turn it on and say to the team, well, okay, this is the new system monitoring and use it. So what we did, Wikimedia also operates this virtualized environment, the WMF lab. So you can say internal cloud, even though it's open to everybody. And this cloud is used, for example, by the community to operate the software that they've wrote and operate on Wikipedia data. So you might not know this, but Wikimedia is also edited automatically by bots. For example, to do countervandalism actions. So if a spam edit is detected, it will be automatically reverted by a bot. And these things all run into the virtualized environment. And also, of course, it's a playground for us before going to production. So we strive for this environment to be as close to production. The only difference being that the resources available are less and the traffic is less and so on. But for all intents and purposes, it should be equal to production. So what we did is use this playground to validate. So the point being that if it works in the virtualized environment on spinning disks with very limited resources on a VM, it will run fine production. And that's actually been the case. So you can see what's been going on, what we did in publicly available interfaces. Those are two instances of premises that are available for everyone to see and look at. Finally, also, the Grafana for labs is also available. So this is what it looks like. Well, this is the standard premises interface, as you can see it's exposed on the web. And those are the endpoints, for example, for ATCD, again, used as a playground for production. We run also around ATCD in production. So how does it look like for a single site? You remember the data centers that I talked about? Well, there is one or more than one bare metal machines running in production, running per meter in production, and multiple per meter servers instance per machine. And again, this has to do with the multi-tenancy point. So multiple teams can have their own per meter server running on the same machine. For high availability, what we do is deploy identical machines and put them behind a load balancer with LVS direct routing. I won't go into the details, but this is the exact same stack that runs our production traffic. So it's basically uses the same load balancer infrastructure as the main website. On each machine, also, we run NGINX for access control, authentication, and things like that as a reverse proxy for Prometheus. And the configuration happens through Puppet. It deploys a certain set of static configuration for Prometheus plus auto-generated YAML files. So service discovery happens by Puppet writing YAML files, which then are picked up by Prometheus. This has been working well for us, because most of our metadata is in Puppet. So Puppet knows most basically everything about the infrastructure. You can find the good details in our Puppet repo, which, again, is public. We use Garrett by this also mirror on GitHub. Again, it's all free and publicly licensed for everyone to look at, both good and bad things. Again, and also that is our instance for our documentation for the technical part of Wikimedia. That is the page for Prometheus. It has more details about what's going on, what will happen, and so on. Then do you remember the picture from the beginning? There is one site, but then we have multiple sites. So how do we get a global picture of what's going on across the infrastructure? Well, we use a feature from Prometheus called Federation. So the idea is that we run yet another instance of Prometheus polling data from the site local Prometheus servers. This way you can get basically a global overview across all data centers of what's going on, say, how much bandwidth they're using, how much memory, and so on. And then once you say, for example, identified there is a problem, you can drill down into the site-specific Prometheus instance. And this also, they have, for example, different retention periods. So the site local might have a month or two months and so on, depending on how much disk space you have. Whereas the global instance, they have less metric. They scrape, select a few metrics from all the sites, and therefore can store data for longer, say one year, and so on. So this is how it looks like in production, one of the dashboards for Grafana. And for example, you can see on the left, those are the four sites with all the network and load and memory used. And then for each of those sites, you have multiple clusters. Say, for example, I'm not sure if you can read it, but there is an elastic search, or the app servers, or memcache. So you have a breakdown of every cluster, how it's doing, and its current status. So this is very useful with respect to, Ganglia gave us something like this but not quite, and we'll see more about that. One of the first use cases that we had in production was actually, for Prometheus, was actually database monitoring. And we have 180 database machines running, well, specifically MariaDB 10, 10.0, with seven main clusters that have the content for the Wikis, not really, but almost, with 21 clusters total. Again, this was the very important use case for Prometheus in production, because the previous tool that we still have in production is called Tendril. And for privacy reason, it's not available publicly, and it's developed in-house. Well, not for privacy reason, but anyway, it's not the point being that it is not available publicly. Well, what Prometheus has allowed us is to open the MySQL data. Say, for example, how many queries are being run on MySQL for everybody to see. So we have publicly available dashboards for MySQL data. And of course, the way we did it is via stock, out of the box components. Say, for example, MySQL de-exporter and Grafana. And this is how it looks like. You can actually look at this dashboard, but don't hit it too hard, please. And this is the query throughput for MySQL all aggregated. As you can see, even on database, there is quite a few queries going on. And what I was talking about earlier is that this allows us, for example, to drill down on specific, well, see both aggregated view and drill down into a single cluster, into a single shard, saying in case there's problems, the most common one would be, for example, replication logic. And you can, at the top, you can select the shard and the type of database that you want to see. This is not something that the previous tool could do, not as easily and not as in such a way that was easy. To visualize, essentially. The last, but not least, use case that we had for Prometheus was replacing Ganglia, which was one of the tools that we had for a long time has been serving as well. It's still there, but deprecated at the moment. And what we use it for is inspect the service cluster health. Say, for example, what I showed you on Grafana, what we would be doing with Ganglia. Even though the data was somewhat static in there, you couldn't manipulate the data that Ganglia gave you. Or you could, but it was very hard to do. So we mostly used it for machine level metrics, a CPU network, all the information that a kernel gives you, essentially. And also somewhat service level metrics, say, for example, varnish and so on. Again, it was easy to see individual machine metrics for Ganglia, sorry for varnish, but not in aggregated fashion and drill down into metrics and so on. So what we had to do is audit all the custom plugins for Ganglia that we developed over time and see how we could replace them with Prometheus components. Again, go to details. That is our fabricator instance where we track issues. So you can look at it at the task relevant for this work. Exabyte, this is a screenshot I took the other day from Ganglia. It seems fine, that is the aggregated view of all the clusters, as I showed you before. But if you notice the network shows, well, not really. Those are acrobytes a second according to Ganglia. I don't think so. We would be having a different conversation if that was the truth. So that is the kind of problem that we were facing. So the tool was okay on a basic level for everyday use, but it was sort of broken, not broken, un-mantained, and so on. And of course, if you can't trust the data that you're getting out of the monitoring, that is basically useless. So what we did to fully part all the Ganglia metrics into Prometheus. Well, the usual case is you take a Ganglia plug-in that is exporting some metrics and replace it with a Prometheus part. All the export does, its job is to convert the metrics that it's getting from a service into a format that Prometheus understands. There's usually two cases. The uppercase is that the exporter is already packaged in Debian, so app.it install, we run Debian across the fleet with some we're going to mix then, but that's on its way out. So app.it install, we have Puppet do it, minimal configuration, and that's it. The less uppercase or un-uppercase is we have to write and package the exporter for Debian. And that is what we did for HHBM, for example. So what we did is use the Prometheus Python client to pull metrics for HHBM on a single machine. The JSON mangled it into a format that Prometheus understands and exported via HTTP. Usually for each exporter, there is minimal configuration that's applied by Puppet. Typically, instructing the exporter where to pull the metrics from, say, for example, HHBM has a local interface that exports JSON. So point the exporter to that. And that's basically open a hole in the firewall. That's it. Plus, tell Prometheus about this thing. So you can tell Prometheus, well, there is such and HHBM exporters pull it from these machines. Pull data from the machine. And build, of course, the Grafana dashboards. As you can imagine, the most intensive part of this is building the Grafana dashboards because they're hand curated by humans, of course. And write and package the exporter, if it's not that. Because, of course, it's called that we have to maintain to test and so on. So I want to talk a little bit about the future of Prometheus. And what we would like to do is onboard more teams. So for example, at the moment, there's only operations that's using Prometheus. Even though other teams have shown interest. Because, again, the operations team experience has been very good with it. We would like to instrument our services natively. Because right now, we're using, for the majority of services, StatsD. So we'd like to integrate native Prometheus support for it. We will be running Kubernetes cluster. And Prometheus has native support for Kubernetes. So that will be one of the use cases for monitoring Kubernetes in production. Of course, we will be adding more and more exporters, more data into Prometheus. For example, we run quite a few, for better or worse, JVMs in production. For example, Cassandra or Elasticsearch. So there is a way to export data for running JVMs. And well, of course, also alerting. That's been a use case we haven't tackled yet. We have concentrated on basically the monitoring side of it. We're not running any alerting yet on Prometheus. But we will soon. And last but not least is Retire Graphite. At the moment, we have a Graphite stack running on four machines, which, as a matter of fact, Friday, they're running into a problem. We burned out through the SSDs because Graphite is not very nice to its local storage. So basically, the SSD is worn out. That's one, among other things, one of the reasons we would like to retire Graphite. Well, how do we do that? It's work in progress. It's not easy. Most of the things are still in Graphite. So we'll see how we will be able to do that. I have a few takeaways for you to take home or on a plane or whichever. Of course, as you could get, Prometheus has been helping the New Committee Foundation monitoring. And deploying into production was fun. It was a rocky road as well. It's not a straight path, but it was fun. Nevertheless, and the games are well worth it. We are now able to inspect our data and model it to what's really happening inside the infrastructure. And of course, multidimensional metrics are awesome. Because earlier, what we had to do, for example, for varnish metrics is construct very long queries from Graphite data. And they don't change over time or play with wildcards and open the naming scheme, stays constant, and so on. So this is all I had, if you have any questions. One of the earlier slides mentioned Isinga too. Can you explain the relationship? So the question was the relationship between Prometheus and Isinga. At the moment, Isinga does, for us, all the checks. For example, reaching out to the machines. But it doesn't know about metrics. It doesn't know about time series or anything like that. It does check Graphite, for example. But what we like to get from Prometheus alerting and deprecating, maybe, Isinga in the future, is run the alert manager for Prometheus in which he will yield more powerful alerting. There was a talk two talks ago about alerting and time series. Does that answer the question? Any other questions? About the data that you get from Prometheus, do you have already gained some stuff that you cannot get from the other way you are working? Yes. So for example, so the question is, do we get already more insights into the data that we are collecting with Prometheus? Yes. For example, you see the spikes there in MySQL queries. We didn't know about this before. And the reason for that is there is JobQ running from MediaWiki that periodically runs a job. And this is the effect on databases of these jobs. So periodically, I think it's every five minutes, they will go to the database and run queries. So for example, we didn't know about this earlier. And thanks for the talk. Will you send any of the new node exporters upstream? Any of the new things you export? Yes. So the question was, will we be sending the new things that we added upstream for exporters? The answer is yes. For example, the HHVM exporter, it's not published yet on the Prometheus website. But we are effectively upstream for it because nobody else wrote it. And maybe in the future, HHVM will not even support Prometheus metrics if upstream is up with it. And we've been also talking to MySQL the exporter upstream about it. I have a question about the data you actually gather. Can you talk about it, how much it is per data center and how you store this on the local Prometheus hosts? Yes. So the question was, can you talk about the data that we collect for each site and how much is it, how it's stored, and so on? So we collect around 15,000, for the main data centers, 15,000 points a second, which is roughly the load that we have on graphite. So just as a starting point, Prometheus is collecting as much data as graphite at the moment. Again, 15,000 points a second. And those take about 1.3 bytes per point on disk. And the machines that run Prometheus have a mixture of SSDs and spinning disks. So I have a related question. What is the most used resource? Is it a lot of CPU usage or memory disk, obviously? Sorry, can you repeat the question? What is the most intensive use of the resource? Is it CPU usage or memory usage? So it's a combination of both. Surprisingly enough, for us, it was not disk Ion. In fact, Prometheus is very efficient, unlike, say, for example, graphite in disk Ion, because it buffers data memory and then streams it to disk. So the main resource problems or usage is CPU, depending on how many queries, for example, you're running in memory. Because at the moment to run a query, it has to load data in memory. So it is essentially bound by how much memory you have. Questions? Is there a reason why you choose bare metal servers for Prometheus? So the question is, why do we choose bare metal servers for Prometheus? So in the trial that I showed you, it actually started as GANAT VMs. And we moved to bare metal machines because they offer essentially more performance. Anymore? OK, thank you very much for the talk. Thank you. This is the last.