 Hello, everybody, and welcome here. Today, we are going to see our instrument application and do a lot of things with Prometheus. My goal today is to convince you that metrics are a great way to improve software delivery. That adding instrumentation to your applications is something relatively easy to do, and that Prometheus is definitely a great tool for that. A few words about myself. I work at Red Hat, mostly developing on Prometheus. I'm one of the maintainers of Allot Manager and a member of the Prometheus team. Before we go directly to the topic, I wanted to do a quick show of hands. Who, before this presentation, have heard about Prometheus? OK, so quite a lot of people. And who is already using it? Like, OK, that's, I would say, half of the audience. So what's Prometheus? At its core, Prometheus is a monitoring system. It does so by collecting, storing, and querying, and alerting on metrics. The origin of Prometheus has been developed by SunCloud startup in 2012, I think, yeah. It went really public in 2015, so four years ago. And since then, it has become more and more popular. And today, it's backed by several companies and individuals that contribute to it. So unlike, I would say, if we compare Prometheus to other systems, I can say that more traditional monitoring systems like Nagios or Xabix that you may know rely more on external checks for the applications that they want to monitor. That means that they don't require particular cooperation from the systems that are being monitored. The paradigm shift with Prometheus is that we expect the applications that we want to monitor to be observable from the outside. That is, instead of being black boxes, they need to expose their state. They are something that will tell us how they behave. And this process is what we call instrumentation. In this presentation, I'm going to focus exclusively on metrics instrumentation, but you can think of logs and events or traces or other forms of instrumentation, which are also interesting to investigate. So let's see for those that are not familiar with Prometheus, how this all works. You can see on this diagram, on the right, we've got the Prometheus server, which is in charge of collecting and storing metrics. And on the left side, we've got our systems that we want to monitor. We can see that Prometheus is what we call a pull-based system in the sense that it doesn't expect an agent to be pushing the metrics. It will take the initiative to retrieve the metrics by itself. It's doing so by performing HTTP or HTTPS request to specific endpoints, what we call metric endpoints. And it does so at regular intervals. Typically, every minute, it's going to what we call scrape the metrics from the targets, from the application that we want to monitor. We have two options for our application to expose the metrics. The first one at the top, so the first application, is using what we call a client library. There exists many, for many languages, like we've got for Python, Java, C, C++, Ruby. Any, I would say today, any programming language as a Prometheus library. This library will offer an API where we are going to create and register metrics. We are going to be able to manipulate the metrics, like to increment them, or to set them to a certain value. And finally, the library will allow us to expose those metrics to Prometheus through HTTP. And what Prometheus gets is what we have at the bottom of the slide. It's a simple text format, which can be readable by humans. And we will describe that right after. So the first option, using a client library, is something that we highly recommend, if it's possible. But in some cases, there are limitations. For instance, we may have written the application years ago, and we don't want to modify it, or we can't really, we don't have the resources to do it. Or maybe the application is already instrumented. Like if you think about Java, you may have GMX in place, and you don't want to rewrite everything because that would be too painful. In that case, we run something that we call an exporter, which is a process that will run next to our application and will be in charge of converting this internal metric format to something that Prometheus can understand. This has obviously an operational cost, like you need to maintain the exporter side by side with your application. Let's go now to more the data format that Prometheus is using. We can see here what we call a time series, which is a reference to something that we want to measure. This time series has two parts. The left part is the name of the metric. That's something that represents the entity that we are measuring. In that case, the name says that it's a total number of HTTP requests that is done by our application. The right side is what we call labels. You can also hear dimensions or tags, but synonyms for the same thing. A label is just as a key and as a value. And the goal of labels is to allow us to filter or aggregate the metrics, the time series, as we wish. In this, the example that I show, we have two different types of labels. We have the code, so the HTTP code and the HTTP method, which are labels added by our web application. And so we can have multiple time series combining different values for those labels. And we've got the job and the instance labels, which are dimensions, which are added by Prometheus when it retrieves the metrics from the target. In that case, the job is some kind of logical entity that will represent our service. And the instance is the address of the application that is from which we get the metrics. So this is what's exposed by the application. So we'll have Prometheus at a specific point in time. So let's call it T that will scrape the target, get the text format, and convert it to samples that we have on the right of the table. Samples are floating point values in the Prometheus time series database. So that's specifically 64-bit floating point values and associated to a timestamp and associated, obviously, to a time series. Then at the next scrape interval, which in that example happened one minute later, Prometheus will again get the metrics, values, store it in its internal database, and repeat that operation again and again until the application stops to run. A small implementation detail on the time series database that Prometheus is using have said that values are 64-bit floating points. But thanks to compression, we managed to get, on average, a single sample is stored as one or two bytes on disk. Now that we have the samples in Prometheus, we can use that to query Prometheus itself. So there is a Prometheus query language, which is specific to Prometheus, something different from SQL or any other kind of query language. The short name is promptql. The simplest query that we can run is just to ask Prometheus for all the time series for a given metric at a specific point in time. So this is what we got on the slide. If I ask for the metric name HTTP request total, I will get these three samples written by Prometheus. I can filter down using label selectors. So I can say I want only the metric for the code 400. So in that case, I will get only one sample. So these are the first two examples of queries are what we call instant vector. They are sample value at a specific point in time. We can also do range vectors, which is asking Prometheus about all the samples in a given interval, in a given range. In that case, I want all the HTTP request total with code equal to 400 over the last five minutes. By itself, it's not really interesting. But what we apply generally on this selector is a function such as rate that will compute the rate of change for this metric over the last five minutes. And in that case, the answer is approximately that value. So that will give us the number of requests per second that the application has done for this particular set of labels. We can do things even more complicated. We can do arithmetic. We can do comparison. And the last example I've put on the slide is something that you would typically use for alerting. And the query reads as we are computing the total number, the total, sorry, the sum of error rate, error being any 500 something HTTP status codes, divided by the total number of requests that our application is doing. And so if that percentage is above 5%, then Prometheus is going to return some results to us. So Prometheus, in a few words, in a nutshell, we've got a powerful query language from QL. We've got, like we've seen, a model which is pool-based combined with some service discovery mechanisms. It's really useful because we can automatically detect when your targets are coming up or down. But that's another complete different topic. We've seen that alerting, like everything in Prometheus, is based on time series. There's no difference between an alert and a standard query, I would say. And last important point is that we've got, Prometheus is really simple to operate. So that's return-and-go, a single binary. You don't just need to download the executable and run it, and it's ready to operate. There are no external dependencies on any storage or database. We just require a local file system to store the data, and that's it. You can get it running on your laptop, in your test, in your CI very, very easily. So that concludes the part for those that were not really familiar with Prometheus, what it does exactly. Now, back to the title of the talk. Why would you want to instrument your software? Well, the first obvious reason is that we ship software for people to consume it for customers to use it. So we want to be aware whether what we ship is something that meets the quality of service that our users expect. To do that, metrics are really good indicators, because if we get our applications to expose information about how many requests it's doing, how many requests fail, how long it takes to serve a request, then we get an idea of what our customers are experiencing. So that's, I would say, the main motivation why you want to instrument your code, your services, your system. The second reason, which is a bit different, is that metrics are also really good for troubleshooting and debugging purposes. You have obviously an integration test. You have benchmarks. You have staging environments, which are, I'm not saying you should replace. But we all know about issues that only manifest when we run our systems in production. And having metrics in place, you can accelerate the time it takes for you to identify the cause of issues. Let's say that your application is doing requests to a database. If you notice, or if you have metrics that report that some requests are failing, you can then ask your monitoring system whether your database requests are also failing. If not, then you know that you need to look for something else. Again, I'm not saying that metrics are a replacement for logs and traces, but they provide real time insights that can greatly improve the time it takes for you to identify problems. The third reason I will mention for why you want to have instrumentation in place is to do some capacity planning. If I can correlate how I use physical resources, or my application is using physical resources, with how many requests I'm doing, then I can better understand whether I'm over provision or under provision. Again, the role of metrics is to provide us with concrete data about what our application is doing. Not what we think it's doing, but what it's doing concretely. So you may ask, OK, you're right. We want to do instrumentation, but why choose Prometheus? The first thing to remember is that Prometheus is not only the server part, the thing that will collect and store metrics. It's also what we call the exposition format, like the thing that the applications are exposing. And this format over the years has been widely adopted across the board, across the industry. There are many other monitoring solutions that rely on that format. There is even a project called Open Metrics, which is hosted by the CNTF, and which is aiming to make that Prometheus exposition format something more neutral and an established that can be used by different monitoring solutions, not just Prometheus. So even if you adopt the Prometheus exposition format, doesn't mean that you have to use Prometheus. The next reason why metrics are a good choice is that adding metrics to your system shouldn't degrade the performances. Typically, modifying a metric value is something in the order of a few nanoseconds. There is no storage requirement. Everything is kept in memory. Everything is aggregated. So it's almost invisible when you add metrics to your application from a performance standpoint. And finally, you may have heard about Prometheus in the Kubernetes context. Like, people often think that they work side by side. But in practice, Prometheus is not limited to Kubernetes. It's also very relevant for more traditional environments, like virtual machines or even bare metal. And the good thing is that if you or when you move to more cloud-based environments, then if you have already Prometheus metrics in place, there will be almost no effort to keep that and to move to that. So now let's go more into the code and what it means to instrument an existing application. For this presentation, I have set up something really simple, which is a web application, like a Hello World. That's a Python application. I have instrumented it with the Python client library. I've got a Prometheus server running next to it, scraping it. And I've got a lot of manager in place that is receiving alerts from Prometheus. This is my application. I guess that's something really simple. It takes a name request parameter. So I can change that and say, change the name. I'm able to generate some errors. Like if I miss the name parameter, then it will return a 500 error. And the code without instrumentation is there. So really simple, 10 lines of code is just one function that does all the processing. So here I've got the version with instrumentation. And what I'm going to show you here for Python app is really something that the workflow is really similar, whether you are using Java, Go, C, et cetera. We will find the same steps when you want to instrument something. The first thing we need to do is to import some types, some classes from the Prometheus client library. So here we can see that we are importing three metric types, counter, gauge, histogram that I'm going to just describe. And then we need to initialize metrics. The first matrix I'm going to declare is a gauge. So a gauge is a snapshot in time of a measurement. That's something that can go up or down. Like any metric, it has a name here, building for. We also need to provide a description so that in the future we will be able to understand what the metric is all about. Here I've defined a label for the metric, which is called version. And the line after that, I'm using that metric. So since the metric has a label, I need to define the label value. So this is what I'm going to pass here. So this is the version of my application. And then I set the value of the gauge to a value, to a certain value. That example, that's a gauge that's not going to move as my application continues to live. The idea of that metric is that it will provide me information about which version I'm running. Next, we declare two counters. So unlike gauge, counters can only go up. They are accumulating values, accumulating either events or duration, whatever you want to track. So again, the first counter is counting the total number of requests that our application is doing. And the second counter is only counting the operations that request that fail. And they have no labels in that case. How do we use those counter metrics? Really simple. Here we look at the yellow function. Every time we enter the function, we just increment the yellow counter. And every time we fall into the we catch an exception, then we increment the fail counter. The last metric that I've shown here is an histogram. Histograms are useful to track typically durations, latency. There are more complex metric type compared to gauge and counters. The idea is that you will define buckets of values. And you will count how many events are falling in each bucket. So here the buckets are defined like this. So this is 0, 0, 0, 1 second. So I will count any request that takes at least that amount of time, less or that would take less that amount of time to complete. And again, to use it, I just need to wrap my yellow function or my yellow processing with some timing function so that the library will automatically keep up with tracking the latency. The last thing we need to do is to wire the metrics endpoint. And this is done here. Typically, a client library will have some utility function that you can use and that you can hear. We just need to import that function that will do all the heavy work for us and expose our metrics that we can see here. So that's the metrics endpoint. We see some metrics which are relative to the runtime, in that case, the Python runtime, so garbage collection metrics. We see also some process metrics, things that we get from the kernel. And we see our application metrics that we've defined. Here, the build info, here the number of requests that we've been processing so far. If I show you now in Prometheus, we can see that we've got here our application being monitored. And here, I can use the Prometheus UI to query my metrics. Here, I'm graphing the rate of HTTP requests that my application is processing over time. And I can do the same for the requests that are failing. So we see that we have some. And again, I can compute the percentage of requests here that are failing, which is about 2% on average. And I can also here make use of my histogram metrics. And I can compute the latency of 95% of my requests. So that's the 90th percentile. I've also defined alerts for my system. And we see one that is active, which is an alert firing when the Prometheus is detecting too many errors from my application. So there are a few things to think about when you start to instrument your systems. There are a few methods that are useful to know. The first one is called the rate method for rate errors duration. It's relevant for mostly web services, RPC servers, and so on. The idea is that you measure how many requests your system is doing, how many requests are failing, and the latency of the request so that you can alert when too many requests are failing or they take too long to complete. There's another method, which is called the use method for utilization, saturation errors. It's more relevant to track physical metrics, like memory. So you can say that you are using almost all the memory or almost all the disk. Finally, if you run things that are more close to processing pipelines, what you want in addition of the rate and the errors is to be able to track how fresh the data is, like you can insert some heartbeat events in your pipeline and measure how long it takes for your pipeline to process it. A few guidelines and good practices regarding instrumentation of application. First one, obviously, use one of the client libraries that is already available. Like we've seen, you will get out of the box many other metrics than just your application-specific metrics. Be consistent with naming of your metrics because that will reduce greatly your cognitive load when you need to exploit the metrics later on if you have something which you can predict about the name of the metric, then it's easier. Like if we track the duration of the HTTP request, we don't want to have a metrics called duration because it doesn't tell us a lot. HTTP request duration is better, but still not good because we don't know the unit of what we are tracking. HTTP request duration milliseconds is a bit better again. But then you can have people using seconds or other microseconds. And so it will not be consistent. What we recommend is that you always use the unit suffix and use international system units. Like in that case, always track duration as second, not microsecond, not milliseconds. Something to look after is not trying to overuse labels. Every label combination that your application will expose creates a new time series. Typical anti-pattern is your application is storing a user ID from the request. And you want to add that as a label. That's generally not a good idea. If you have hundreds of thousands of users, you will end up create hundreds of thousands of time series for a single application. Multiply that by the number of instances that are running your application. You can easily end up with millions of time series, which Prometheus is not designed to deal with. So in doubt, just refrain on defining too many labels. A good rule is to have something like in the order of maximum 1,000 to 10,000 time series for a single app. If you go that instrumentation path, focus first on things that matter for your customers. Then take time also to instrument what happens where your application is doing request to upstream services, like to your caching system, to your database, to some external APIs. That will also help you to pinpoint problems, pinpoint issues. Last thing to remember, for lack of a better term, I call them informational metrics. But they are like we've seen the building for metrics, something that you can use to provide more context about what's running. Typically, storing the version, the commit ID, is always a good choice, because you can then also correlate between different versions how the metrics are evolving. Quick word about alerting. We've seen that alerting rules are just regular PROMQL queries, nothing fancy. Once you get to know PROMQL, you will be able to create alerting rules very easily. Again, focus on what matters for your customers on the symptoms rather than the cause. So alert on the symptoms. But again, don't forget to have metrics to identify the causes. But do that in that order. First, focus on the symptoms, and then focus. Try to explore the causes. And it's OK to have alerts that fire when something happens badly in your dependencies. But try to treat that only as a second level of your urgency in general. So we come to the conclusion. I hope that I have convinced you that we should, as developer, take care about metrics and instrumentation from the start of the development cycle, just not as an afterthought. We've seen that metrics are cheap to implement. They don't degrade the performances. And they can definitely improve our software delivery process. And the last point is that, even though you are not using Prometheus, if you are looking for something open source, then have a look at Prometheus that's definitely the best open source solution today that exists. Thank you. And on that, I have finished. I don't think we have time for questions, but I'll be at the red ad booth this afternoon from 2 to 4 if you have more questions, or obviously right after the presentation. Thank you.