 Hi, thanks again for HONUSB for hosting us. So that's a great theater today, feels really cool. My name is Arsene Chernoff, I'm a practitioner. I'm in technology for more than a decade. I currently work for a standard Chattabank in the Cloud team there. So I'm here to do the introduction to Prometheus. This is going to be a broader meetup series about the modern infrastructure monitoring. Today we start with Prometheus as the most hot topic, and I will then pass the word to my colleagues from Cloudflare that would debunk the ways to go for thousands and hundreds of collocation and thousands of instances to get monitored by Prometheus. So it's really in a nutshell all about a couple of guys that started Prometheus. So the story goes like, there was this gentleman, his name is Matt Broud, he's on LinkedIn. He joined SoundCloud from Google. After being a software engineer in Google for over five years, and he's got a monitoring system open source project in mind that he wanted to develop. And it was like his sole intention. And in Google, he was part of the software development team on LinkedIn, but in fact, he was really closely working in some team that then formed what we know as SRE team, so Site Reliability Engineering team. So another colleague of his from Google also joined SoundCloud afterwards, and he's been in Google for two years, pretty much in the same team with Site Reliability and Charter. And he contributed, they caught up, and he really started contributing to this small open source site project. And then they both merged their efforts and done something that became like a Prometheus alpha, Prometheus beta. It was not yet live on any of the repos, but internally they started using it at SoundCloud. And there was another committer, let's put it, that actually joined in 2013. And he also was an ex-Googler. So they have one big paradigm in mind, and they wanted to make sure that Public has all the principles of successful infrastructure monitoring that adopted in Google, available in. Sorry, close it up? Yeah, can you close it? Sure. Enjoy. No worries. I thought it was a question. I'm like, why would you ask questions at this stage? Yeah, so yeah, these were the three guys that formed. There were actually more of them ex-Googlers and really savvy guys that formed the beginning of Prometheus. And everything they wanted to do is to deliver an open source project that would resemble what they've been doing in-house at Google. And we'll talk about what Google approached to monitoring large-scale distributed infrastructures is later on. So basically, major milestones after the early release, like internally in SoundCloud, they rewrote the storage layer and converted the storage system to use the file system natively. So use the chunking of the time series and put it into the file system instead of relying on the particular database as a back end. First public released and followed in January 2015. In May 2016, Prometheus joined the Cloud Native Computing Foundation. And it's now a very close neighbor to Kubernetes out there in CNCF. Prometheus Studio was just announced, like literally a couple of weeks ago. It received a lot of updates that were a long due on the roadmap. Particularly, we'll talk about them. And I hope the guys from Cloudflare would also chime in, particularly to the storage of metrics and also performance optimizations. And today is our inaugural Modern Infrastructure Monitoring Meetup. And we're starting with Prometheus. So that's also a milestone. So full story, I'll share in every slide. There will be a link. If you want to double click and then read the full story, just go for it. So the motivation behind it, as I've mentioned, was the best practices that Google Site Reliability engineers were really developing in-house for many years. And they brought it into the open source space with Prometheus. First, what is SRE? There is this book that is available online. And of course, you can also buy it if you want to print it or in Kindle format. And it covers a lot of aspects. But the idea is for successful planet scale operations of the environment that is as complex as we know Google is, you have to have software engineers do operations. And they do the same work as the operation team, but they automate instead of doing something manually. And there is this notion of toil that they kind of introduce. And it's doing something once or maybe two times manual is fine if there's a pager and some outage. But then going forward, they are bound to spend up to 50% on developing automation to avoid that manual labor intervention. So at any given time, over a month or over a quarter, Site Reliability engineers are not able to spend more than 50% of their on-shift duties or doing some ops. They really need to become professional in automating what they've done manually and doing it at the global infrastructure scale. And collecting indicators apparently is one of the very important aspects of the job to understand what's not working well, what's the type of error to predict the development of the situation. So Prometheus is directly actually mentioned in this SRE book. And of course, this gentleman that we talked about previously, they've done a lot of work for that. So Google has those internal notions. They were actually developed in Google called SLIs, SLOs, and SLAs. Right now, we've probably got used to them already, but I decided to just repeat them for us. The SLI is actually the metric. So you're defining a particular performance metric of a particular service in some way. An example of an SLI would be a request latency or an error rate, for example, or a system throughput. And then when you have the SLIs, you can define an SLO, which would be your objective. Objective is not something that is usually referred to as agreement. It's actually much less strict. You have an upper bound. You have a lower bound. And basically, it says that this is where I predict my production environment should be to keep me happy. The SLA is actual contractual agreement. So it's some way to guarantee and some way to pay and some way to rebate if you're not meeting a particular SLO. So it's usually even looser than SLO, because you don't want to go into the troubles. And for example, overcommit and underdeliver. And one very simple example would be like you're promising the SLA of 99.9. And your SLO at the same time for particular service is 99.95. So that's presented in here. Like SLI would be an indicator of my HTTP status codes. And my objective is to have less than 1% over HTTP 500s rolling over 30 days. And then my agreement, the business agreement, and the one that I'll be penalized if I breach, is 10% of monthly refund for every additional 0.1% of those error HTTP 500s. And then you basically track it in a latency way. You may track it in throughput, but this is the idea. You set up an SLO. You have a target, and then you start tracking it quarter by quarter. And that's the internal charge-back or direct user agreement that you publish on the website. So now there are also four golden signals that Site Reliability engineers propose to monitor any given system. They say that if there is only these four that you can monitor, you'll be fine. First of all is latency, because latency is the essential performance of your web app. So what it takes to service a request. And also there is a good advice to track error latency, because low errors, they are even more irritating for the users. You really don't want to only focus on proper responses on 200s, and there are OK latency. But you also want to see and track and be able to get awareness of the failures and how long it takes for user to actually see the error traffic. So how much the system is loaded at a given time from the network. And it depends on what type of system it could be. For example, a transaction per second metric if it's a database or some particular queue or it could be throughput if it's a streaming or if it's a cache origin, for example. Just generally the errors, how much particular service is failing in terms of, for example, wrong content delivered. And this could not be simply measured. You need to have some probes. You need to do probably periodic checks like what would be the response to this particular variable or what would be the behavior behind it. But just errors as not just a web server errors, generally how are we dealing with a particular load and a particular set of parameters that go into the system. And saturation. So how usual metrics of CPU, how much load can I tolerate more? So what are we showing right now? So how much more in case my traffic is ramping up, how much more can I stand about this infrastructure side when I need to stand up more services if needed. And then once the SLOs are defined, at Google the notion of error budget is quite peculiar. So the idea behind it is it's OK to have some failures. And if you withdraw the SLO from 100%, this is your error margin. So to say to experiment to actually not plan to have downtime, but plan to fail some experiments or maybe try to push the release train a little bit forward so that you could achieve an earlier stage of your service rather than delay it to a particular change window. But once it's depleted, you're no longer allowed to touch the infrastructure before the next counting cycle. So if you used all your budget, for example, you measure it week after week. So you have only that amount of outage that is allowed a week. You're no longer able to push any more releases until the counter resets for the next week or for the next month. And that way the DevOps team, the SREs, and actual product development teams, they are all aligned on the same set of targets. They all want to move forward with more releases. But at the same time, they understand that infrastructure has some availability metrics to conform to. So if they are out of the budget, then they basically stop pushing it and wait for the reset. So what Google does is they have most of their applications exposing their internal metrics. So say thinking about containers, for example. So anything that is deployed in Borg scheduler, that is the internal scheduler that Google uses, exposes the metrics of that container into the built-in web server. So anything that runs on Google, any application, any microservice by default has the metrics exposed by something called VAR-Z. So it's a web server that runs a particular path called VAR-Z in the virtual host. And then you can pull it for all the metrics that are there or basically get, depending on the server, get performance of web or a particular database or KDS or anything. So they expose all their metrics. So what happens is they have a distributed set of systems that go and scrape this metrics. And architecturally, we'll see that Prometheus is precisely the same. So it's all about the polling interval. You go through a particular target system, get the metrics, and then you store it in the time series database. In Google, there are in TSDB, which is a blue unknown box to the left-hand side on the picture. In Prometheus, there are multiple ways, but it just ends up as a chunked time series that are scattered around the B3 on the file system of a regular instance. So traditional monitoring in Kubera kind of fails. If you think about funneling all metrics that you have into one big collector and then kind of graphing it, it's a lot of traffic. And also, there are not just many targets that you have to monitor, but this target is dynamically changed. So you have releases that you roll daily or even hourly or weekly. You have all these changes that happen in your monitoring infrastructure, those become your unique instances that you still have to track and be able to address. And it's not really possible to have a query against your current state of the infrastructure without a complex dashboard or defined metric in the traditional monitoring solutions that are there. So what Prometheus offers is to collect the metrics the way that we've discussed Google does. So it exposes pretty much any service metrics through something called exporter. And there are at least like 50 of them. You have them for different types of web servers, for different applications. And then you're able to use something called PromQL. So PromQL is the query language for those metrics. So you can easily create a top view for three metrics against some label, for example. And that label would be for a particular type of metric, like for HTTP requests in total and status starting with 500. And then you do that representation in five minutes. So you define these queries and you have client side libraries. So you literally can run like a Jupyter notebook and connect to Prometheus server and do the investigation and drill down using a client side libraries. Or there are visualization tools and also alert tools that come as part of the Prometheus server itself. So on the left hand side is the Prometheus server architecture. You see the exporters that are getting pulled by a particular retrieval interval. And then they are written in the right ahead log so that you basically have the ability of service crashes to restart what's been in flight before it got into the file system. You're able to recover to some level of granularity what happened in your environment, even if your Prometheus instance is restarted, for example. And then it gets chunked into the SSD. And everything that is there in Prometheus is queryable with this PromQL quite developed language. And also the alerts are getting pushed into alert manager. It's also possible to federate different Prometheus servers. And guys from Cloudflare, they will hopefully share the really, really large Prometheus infrastructure. Really cool. And still you can pull it with Grafana and do additional hooks into Prometheus using different clients. So storage architecture is with one thought in mind so that monitoring system must be more reliable than the systems it's monitoring. So that's why I think it's actually quite funny. I think it's an excuse really rather than a feature. But for now, there is no special Prometheus default way to store the metrics apart from just chunk them into the local file system. Yes, there are some threads about using the LVM beneath it or using a third party connector. And there are lots of, not connectors, adapters. There are some adapters that allow you to write or only read or write and read into different type of backend systems for long-term storage. But for now, it's all about like a local system and it's quite fast. The needed disk space is literally a retention time by how many samples you're gonna collect per second and about two bytes per sample. There is a very interesting talk that was on PROMCOM in earlier on this year about the way that they figured the compression and it's really efficient. So if you think about a particular timestamp plus the sample of that metric, it should have occupied two times 816 bytes but they actually keep it in two bytes. So they are quite good at that. And I just wanted to wrap up with what Prometheus is not. It's not 100% accurate. They basically claim it as a tool that allows you to do the operations on a large scale and understand like what's happening in your environment right now with very convenient way to query it but it's not 100% accurate. If you really need some accuracy, for example, to cross-charge or to build someone, then you need a different level of storage, different level of reliability and there are different solutions for that. It doesn't do any logging, so it only collects metrics. It is also not anything like anomaly detection so it can provide you with alerts if some metrics go wrong but there is no additional logic. If you want, you'll need to develop anything on top of Prometheus yourself and definitely it's not a dashboarding solution so still you need to have like a standalone solution to chart, to plot and to be able to leverage Prometheus. And the idea is to have one Prometheus server in each failure domain so that if that domain is gone by the definition of your availability, you basically are accepting the fact that it's gone through some external monitoring, that Prometheus server is gone so for you, the whole domain is gone. And within that domain, you keep the instance, if everything works well, you keep the instances monitored by a designated Prometheus that sends the metrics up into the federated and multi-tier architecture. Well, that's about it, that's just a brief intro before you guys see the really exciting things that Cloudflare colleagues are bringing to the table. So, any questions? Yes, please. So, going back to the SLIs and SLOs and SLAs, which team in the kind of Google world is responsible for coming up with all of these? It's I guess teamwork in between the product team and the operation so I'm not sure how the change control just generally works in Google but I guess they do have like some pre-production meetings before a particular application goes live. So on the commissioning, it's the right time from my assumption to define like how it would work, what it requires because some of the applications, they need like N plus one reliability in terms of like the regions or slots of the infrastructure that they run. Some of the applications, they need some global low latency but they can go down in a particular way, in a particular manner. So that is all negotiated before going live and then the definition of SLO is then kept as the agreement in between, this is basically the tightest, most metric if you think about it, right? So the definition of SLO is like we will not allow this service to go down more than this amount of minutes a month or like a week and then within that budget we will try to experiment or examples would be like we will actually agree that even if your service keeps up more than it's agreed, SLO, we will send a chaos monkey and shoot it down for example so that you are not relying, you're not exposing your systems availability that naturally becomes more than it's defined SLO. So they restart the servers, they restart the instances just because some people may think that it's a higher SLO. Okay, well, time for the cool demo. Sure. Yeah.