 Hello again, my name is Fabian and within the next half an hour I'm going to talk about monitoring legacy Java applications with Prometheus. So what is the best way to monitor Java applications with Prometheus? I'm sorry for my voice, I have a bit of a cold, I hope I can make it through half an hour. So the best thing you can do is to take the Prometheus client library for Java, add it as a dependency to your Java project and just monitor your ad metrics directly to your source code, okay? This is always the best option, it's the most flexible one, the most powerful one, and the safest one and whenever you can modify your source code, this is what you should do. However, sometimes it's not possible maybe because you have to run an application where you don't own the source code or even if the application is developed within your company, it might be that your wishes of having Prometheus metrics do not have the priority so that you can get your code changes into production. And there are a few options, what you can do when that happens and within the next half an hour I'm going to talk about a few of the options that you have. We start off with a log file monitoring example, then we go ahead and have a look at the black box exporter, then I'm going to talk a little bit about JMIX, the Java management extensions, and finally I'm going to give you a quick demo on how to write your own do-it-yourself Java agent for Prometheus monitoring, okay? So let's start. As an example application, I implemented this little thing here, it's a Spring Boot application with some little REST interfaces. So if I click on this here, it hits the slash hello slash Alice endpoint and shows the result down here. I can also say hello to Bob. I have a small endpoint that just throws an exception in the Java code which results in an HTTP 500 internal server error and I have one that just sleeps a second before it responds. And we're going to monitor this application now with different tools within the next half an hour, okay? So let's start. First up is log file monitoring. There are a few tools available. The most popular one is by Google. It's called MTAIL. It's also in the official Google repository which obviously helps to make it popular. There are small independent tools as well. For example, there's one which is the grog exporter and I'm going to use this in my demo and I'm also the author of grog exporter. But I'm not going too deep into like configuration details so most of what I'm going to show applies in general to log file monitoring with Prometheus, okay? So let's have a quick look at the log file of the application. We see a stack trace here which is of course because I hit this slash throw endpoint so that gave me a little log line with an error here. Something unexpected happened followed by the stack trace but most regular log lines look like this. So if I click on Bob and Alice for a few times, I can generate some log lines and they all have this format, okay? It's a broken line here with this resolution but I have a timestamp, thread information, the log level which is info in most cases, name of the Java class that locked that thing, some information about the HTTP method and path and the duration how long it took to generate the response, okay? So at this level when you have something like semi-structured or unstructured log data like this, what basically all of the tools do that you use to monitor these things is you create regular expressions that match the log lines and then you extract metrics out of the regular expression matches, okay? So if you look at how this works with the grog exporter, it works like this. So this down here is the most important line. This is the regular expression. The only specific thing about the grog exporter approach is if you use regular expressions a lot they tend to become a bit unreadable and what you can do with grog is you can define regular expression snippets like for example you say this snippet should be made available under the name path, okay? And then instead of writing this here in your regular expression, you can also refer to the path that you defined above, okay? Apart from that it's just a plain regular expression. So you start off with anything, then comes the log level, then the name of the Java class, this is just literal string here, then we match the path, then the duration and any number that could be the duration, okay? So if we define this and we define a counter metric for this in Prometheus, what we will see is just a counter counting the number of lines that matches this regular expression. So if you have a look at it, grog exporter minus config 01 and go to local hosts port 9144 slash metrics. So what we see down here, make it a bit bigger. So what we see down here is just this metric here. It tells us we have 17 lines matching this regular expressions in our log file. This is not very useful, of course. What we need at least is some information about the number of errors and warnings and info messages. And the way you typically do this in Prometheus is with labels, okay? You're going to say, I want to extend my metric here. I'm going to use labels. And I want to have the log level as a label. I also want to have the path as a label. And in grog exporter you can just do it like this. You say the stuff that matches this regular expression snippet should be the level. The stuff that matches this regular expression snippet should be the path. And then you refer to it down here. You say, okay, this is my level and this is my path. These expressions here, there's actually a little templating language built into the go programming language. It's a pretty easy templating thing and then this is used here also. It's also used in other programs that are written in go. It's just a pretty simple templating language for configuration purposes. And if we start it now, what we're going to see is more useful information. Now we have those metrics over here, okay? We see we have one error. We have seven info messages with path hello Alice. We have eight ones with path hello Bob and so on. And this covers already 80% of what you need in real life log file monitoring with Prometheus, okay? Because now you can import this into Prometheus and then trigger some alerts whenever errors occur or watch your warnings and when they reach a certain threshold you trigger alerts and so on, okay? I have one word of advice about using the path here. What actually is going to happen in the Prometheus server is that for each line that you see here in this ASCII representation, there is a time series created internally within the Prometheus server, okay? So whenever you have a new combination of label values, like a new path, a new time series will be created within the Prometheus server. And Prometheus is pretty efficient with handling a lot of time series. So if you have like thousands of different time series, it shouldn't be a problem. But if you have millions of different time series, then at some point hit the performance threshold that you have with the Prometheus server, okay? So what that means is if you put something like the path in here, it's a very good idea if you have a limited number of paths. Because it shows you which rest endpoint is called how often. If you have something like a unique request ID or something as part of your path, then it's a bad idea to use it as part of a label because then with each new request ID you will generate a new time series and eventually you will have so many time series that even the very great and efficient Prometheus server has problems dealing with this, okay? So take care that labels should have somehow a limited scope like in the order of magnitudes of not more than thousand different values or something, okay? Good. So this covers already like 90% or 80% of what you need in real life. I still wanna talk a little bit about this duration value because it's a good opportunity to also highlight the difference between log file monitoring with Prometheus and log file monitoring with the Elastic Stack, okay? The Elastic Stack is like the most popular log file monitoring tool for Java applications, it's in use in a lot of projects. And the way it works, you have Elastic Search, I guess most of you know that. It's a full text search engine, a clustered full text search engine. And you have little programs called beats. Those beats take log lines and ship them to the Elastic Search cluster where the log content is indexed. And then you have a user interface called Kibana. And you can use Kibana for two things. One is a full text queries. So you query for error messages and so on and find your corresponding log files. And the other is aggregation, which is similar to what Prometheus does. So for example, if you index those duration values in your Elastic Search, you can run aggregations like give me the median value that was observed within the last two hours or so, okay? So question is, if you go with log file monitoring with Prometheus, do you still need your Elastic Stack? Answer is yes, you need it because of the error messages. So if you have a stack trace like this, let's scroll up a little bit. Like for example, an error line like this, what developers want to do, they want to search for it and find the error message here, okay? And if you do logging a little bit more clever, you would maybe log a unique request ID with each log line that is associated with a request and then you find the request ID and search for it in Kibana and then you find all the log lines that have to do with how it came to this error, how it occurred. And this is something that Prometheus doesn't do, okay? Prometheus has nothing to do with full text or text or messages with just about numbers and time series. So you still need the Elastic Stack. The other question is what's the difference between monitoring those duration values with Prometheus and monitoring those duration values with the aggregation capabilities of the Elastic Stack, okay? And the difference is with the Elastic Stack, you ship each and every observed log line into the Elastic Search cluster, okay? So each individual duration value that you have here is available to Elastic Search and the aggregation happens on the fly when you run a query, okay? You say, give me the median in the last two hours and then those values are aggregated. With Prometheus, the mindset is different. You never ship each and every observed value to the Prometheus server. The Prometheus pulls every, I don't know, minute or half a minute or so, however you configure it and then it expects aggregated data of what happened and so what's going to happen is you don't ship each individual observed duration value. You aggregate the data and then provide the aggregated result to the Prometheus server and the Prometheus server only sees aggregated results. So let's look at an example. How would you do this? So you basically choose a metric that's good with aggregation. There is one, for example, that's a summary. So I'm going to change my metric type from counter to summary and then you just tell it which value should be aggregated, okay? So I'm going to say my value should be, I can say the number that I have down up here should be under the name duration, for example, duration, okay? So I highlight it so you can read it and then I can use it down here. I can just say duration. So now I have an aggregation metric called a summary and I'm going to use the duration as the value for aggregating. Oops. So let's save it and start it again and if you look at the metrics right now, what we see is something like this. We have still our log level and our path and we also have those quantiles here and what that means, quantile 0.5 means half of the requests with log level info and the path hello Alice, they're faster than this value and half of the requests were slower than this value, okay? Quantile 0.9 means half of the requests were 90% were faster than this and 10% slower than this and so on. So this is an aggregation value and this is the data that's provided to the Prometheus server. Summaries are not the only metrics for aggregation. There are also histograms. Each of those have their pros and cons and there's a very good documentation highlighting the differences between summaries and histograms and when to use what. But the principle is you aggregate the data first and then provide aggregated data to the Prometheus server, okay? I think I will take questions after the talk so because otherwise it will be a bit busy with the half hour. Okay, so that's it for the log file monitoring example. Let's go on to something totally different. Let's have a quick look at the black box exporter. This will be very quick because there's not much to say about the black box exporter actually. I have a little script that prints the URL so the way it works is you start the black box exporter. In my case it will start up under this URL, okay? And then as a URL parameter you provide the probe that it's going to do, okay? And this is my actual endpoint that I'm going to probe. So this is the hostname because my demo is running in a Docker container so it had such a cryptic hostname, port 8080 and this is the API it showed probe. This is the sleep 1000 milliseconds API and then as a second parameter I provide a module and the module HTTP 200 is like the most important module because it just checks does the endpoint reply with 200, okay or not, okay? So let's try it. I just saw the black box exporter. It doesn't really have any application-specific configuration so if I copy and paste my URL here I'm going to see the, so the most important metric is this probe success one. That just tells me, okay, the probe was successful server replied with 200, okay? So it's a Boolean value. If it's unsuccessful it's zero, okay? And there's some additional information for example the request duration, where is it like here, okay? That tells me that it took a little bit more than a second for this endpoint to reply. This is the most simple thing you can do. It's the good thing is it always works so even if your application does not provide log lines or does not have JMX or nothing else you can always still use black box monitoring from the mindset it's very similar to Nagios checks, okay? You just run a check against the application if it's responding it's okay, if not you can trigger an alert or something. However it's not very powerful so the application might easily return 200, okay? At some end point while it still has problems and you won't figure it out that way, okay? Good, that's for black box monitoring. Next up is JMX, the Java management extensions. I guess most of you have heard of JMX. It's something like a remote procedure call protocol I would say which is built into Java and available with each Java program and it has been around with Java for a very long time. Maybe since the beginning even I don't know but very long and the way it works just let me start a JMX client here. Don't know if the resolution will make it. So this is the Java visual VM which is a standard tool included in the Java development kit. So every Java developer has this on their laptop. Maybe they don't know it but it's part of the Java compiler so they have it. And what you can do with it, you can connect to a program that exposes JMX beans. You can just have a look at the M beans and there is some JMX information available about different things. Okay, first of all, the Java virtual machine by itself provides some information. For example, like information about garbage collection, heap usage, CPU usage and so on. And typically with web applications you have a servlet container. In this case it's Tomcat because boot applications always use Tomcat and it provides some information like, for example, the global request processor here. It tells me error count one which means that there was one error. If I go back to the application, hit the slash throw endpoint which just throws the exception internally and now there's a refresh button down here which I cannot reach right now. Ah, yeah, there it is. Okay, if I refresh it goes up to two here. Good, so this is JMX and there's a tool called the JMX Explorer that you can use to read this data. Let's have a look at how it is configured. So it's basically, the red is not very readable. So it's basically a collection of very huge regular expressions. The good news is you don't, maybe you don't need to write them yourself because there's some predefined configurations for things. You typically use, for example, this configuration for grabbing JMX metrics from Tomcat is part of the JMX Explorer distribution. So you maybe can just go along with that. But when you look at it, what's actually happening, so this is basically the part that I clicked in my JMX tool. So if I looked at it, look at it. So there's Tomcat, global request processor, this and then the error count and here we have something starting with Tomcat and ending with error count. And if you analyze it, it's kind of going down this path in this tree, okay? And then there are some capture groups which are used here to actually extract metrics out of what it's found with JMX. So I don't need to start it right now because you actually attach it to the application on startup. And I actually did this when I started my application. So it's already running on port local host nine. And now one, two, three, four are used, okay? And if you looked at here, error count, here it is. If you look at the error count, you see it's the same value. It's the value two, okay? Those labels here, port 8080 and protocol HTTP and UIO. This was parsed with capture groups out of this little thing here, HTTP and UIO 8080, okay? So one word of advice about the JMX Explorer. JMX is actually, it's really more like a remote procedure called protocol, okay? It's not a read-only thing. You can write values as well, which is not relevant for monitoring, but originally it was designed so that you can reconfigure applications on the fly, okay? And what that means is whenever you call a JMX bean, there's no real guarantee what's going to happen, okay? Because whatever the implementer of that JMX bean had in mind can happen when you call it. So when you come from a Prometheus background, you're used to that when the Prometheus server pulls metrics from an exporter, that's a cheap operation, okay? You just get the data, the exporter has the data readily available and you pull it. With pulling JMX beans, that's not necessarily the case, okay? There can be JMX beans that have their data readily available and just provide the data, then it's cheap, but there can be JMX beans when you call them, they trigger some complex operation or maybe even reconfigure something or whatever. So it just depends on whoever implemented that bean, so there's no guarantees and you never know. And if you, the thing about this regular expression approach is of course, you first need to load the JMX bean, then you match the regular expression and if it doesn't match, then you throw the result away, okay? So you basically load more JMX beans than you would necessarily need at the end in your metrics. And if you do this like with a poll interval of every 15 seconds or every 30 seconds and you have some expensive JMX beans in your server, then you might have some significant performance penalty with that and there is configuration in the JMX exporter, which is blacklists and whitelists. And if you experience problems with that, you can use blacklists and whitelists to fine tune which JMX beans are actually accessed and which ones are not accessed, okay? So just have a look at how much load your JMX polling produces and use blacklists and whitelists to fine tune that and then it's a good approach. The good thing is that out of the box Java has some really interesting metrics, heap usage, number of, I don't know, classes loaded and CPU usage, so it's a pretty good information that you can get out of it. Okay, then let's go to the last point, which is actually my favorite in the moment because it's the most powerful and also the most experimental how to do your own Java agent for Prometheus monitoring. So what is a Java agent? So Java, as you, my voice. So Java, as you know, it's something like an interpreted language, okay? It's bytecode and the bytecode is interpreted on the fly in the Java virtual machine before it's being executed and the Java virtual machine provides something like an interface, something like a hook where you can write a Java agent. The Java agent hooks into the process and whenever bytecode is loaded from a class file, the agent can go ahead and modify the bytecode before it's being executed, okay? So the code that's actually being executed is different from the code that was originally implemented in the application. And one thing you can use this for is you can write your own Java agent that just puts in calls to the Prometheus client library into your application if the application didn't use the client library before, okay? So if you are interested in how to do this from scratch, there is a Java conference in Belgium. It's called DevOps. So last November I did a half an hour demo there like how to do such an agent from scratch. For now I would like to show you something that I implemented as an experiment. It's called the Prometheus, the Prom agent Maven plugin. So it's a Maven plugin that just helps you to do this very quickly and very easily. So what I have here is a Java project with only one single class, okay? I call it a hook because it hooks into other classes. And what I did here, I annotated it with the classes or interfaces that this is going to instrument, okay? And in this example we are going to instrument the servlet and the servlet filter. And those are both standard Java interfaces that are always involved when you respond to HTTP traffic, okay? So we can monitor HTTP traffic with that. And then there's a little bit of code just setting up the Prometheus metrics. That's just the same as with the standard client library. And then I say, which methods am I going to instrument? I'm going to say, I'm going to instrument the service method and the do filter method. And before one of those methods is called, I'm just going to measure the time, take a timestamp. And after one of those methods is called, I'm going to collect some metrics like the duration from my timestamp, HTTP method, path, HTTP status, like 200, okay? And I'm going to maintain my metrics with it. So this is the only code in this project. The other thing is the plug-in. So it's here, prom agent, maven plug-in, it's on prom agent IO. And if you configure this as you're in your, link it to your packaging phase, if you just compile your project, maven package, what will come out of it is a running agent. So let's look at the target folder. So what we have here now is this prom agent.jar and this is an agent that's ready to be attached to a Java application and instrumented with these metrics. So let's just stop the original application and start it again with the agent attached. So it's a specific command-line parameter minus Java agent and the path to the Java file and so on. And if this is up and running, it should work as before. Let's just, yeah, hello Alice. It still says hello Alice, hello Bob and so on. So still working, but if I now go to port 9,000, I think 300 I used, I get some metrics. They look pretty similar to before, okay. HTTP method get, this is the path, status 200, okay. It was called four times. So I clicked the Alice four times right now and with, oh, I also see that I loaded jQuery min. That's part of my user interface because I reloaded the application and so on, okay. So this is actually interesting because it's a really powerful approach because you can basically instrument everything. Like if you want to instrument database queries, you just want to instrument the standard JDBC interfaces and you've got information about that. If you want to instrument thread pools, just know what your thread pool calls are and then you can instrument that and so on. However, it's really experimental. So if you use it right now, I guess you're one of the first users. So feel free to create GitHub issues and provide some feedback and tell me if it's useful or not. But I think the potential is pretty good because with this thing, you can basically monitor everything that you want. Like even if you don't have JNX, if you don't have log data, even if you want to monitor your proprietary business logic, all you need to know is how the Java classes and methods are called, write a hook and then you have metrics for that. So let's see what's going to come out of that, okay. So let's wrap up. Within the last half an hour, we went through four ways of monitoring legacy Java applications. We started off with parsing log files. Then we had a quick look at the Blackbox Monitor, Blackbox exporter. Then we had a look at the JNX exporter and JNX. And finally, I showed you how to write your own, do it yourself, Java agent. As I said in the beginning, all of those are just workarounds because the best thing you can actually do is download the Prometheus Client Library for Java and put your metrics directly in your source code. So this is what you should do whenever it is possible and all of those are just workarounds that you only use when you cannot edit the source code of your application, okay. So thanks for listening. I will stay around here. So for questions though, we can take them after the talk and have a nice foster. Thank you.