 Do you hear me well? Is it clear? So today we will talk about monitoring in general and in the context of microservice basically. We work in transfer wise. Our mission basically is to make money transfer across border as convenient as possible and fast and fair and transparent and eventually free for customers. Customers basically it's in the main focus for our business. We try to make sure that he has the best experience while he transfer the money and we make sure that everything is transparent and fair for him. My name is Dwai Muhammad. I'm a software engineer in Transfer Wise. And my name is Vladimir. I'm also an engineer in Transfer Wise. Okay. The agenda today will be basically we will talk a little bit about what is monitoring in general. What's the domain we are talking about. The problem we are trying to solve. And then we will talk about the three main pillars for monitoring which is logging, tracing and metrics. Anyone know what is microservices from the audience? There will be prizes by the way. A lot of servers. Why not having one? It could work either way. Synchronous or asynchronous. It depends on your name basically. But why we split a lot of things? Because you don't want just one server. So that it's distributed so everyone's supposed to do it. Let's talk a little bit about descriptions. Anyone have any other idea? Okay. Okay. Good enough. Why do people insist on using those? Let's try to give a quick description about it because it's not the main objective for the talk. But back in the days there was one application, big application called monolith. Which is basically one deployment that do all of the things like you want. Basically your business. But that will be like if you need to change any feature in this big application you have to redeploy basically the whole chunk. You might be even like basically interrupt some other cases. The way to solve it is basically to dissolve this monolithic to small services based on the context. And then every service will do one job but it will do it very I would say in a better way. And if you need to change anything in any service it will be like a nice work in isolation. You can redeploy it without any affliction to the other service. So it will provide you basically we can say that the microservices like in deployment perspective you gain a lot because you have some kind of isolations and so this is basically the chart like how previously was. Like this one application now it's like more kind of coordination between multiple services. And now we will talk about monitoring. What's monitoring about and why we are so like it's very interesting. Monitoring is the main I would say is very it's one of the main things that you would need to think about when you deploy application because it's the only way to understand how your application is working and how it's behaving like to some actions or like customer loads. Basically actually without monitoring you cannot tell if your application is even working or not. But the way you approach the monitoring it can be classified into three pillars which is we will talk about like about regardless of the approach or the you address the monitoring it's like it's all about something happened to your application and you want to know about it. This like this event happens. What are the characteristics or the environment or the context that identify this event. So it's all about events. But the way you are react on this event is a little bit different. So the first let's take an example if I provide a service that will let's say present your rate and exchange rate for multiple currencies like from source currency to target currency you provide the rates. So if I call this service this is basically something happened to the system the request come to the system then you need to react on this. The first approach is just like logging. It's basically say hey at this time of day I received this request and it took let's say 100 milliseconds. So this is the approach of logging. Basically the second approach is like this 100 milliseconds like why it took like 100 milliseconds. It could be like the request go to the database or like ask some routers to get the made market aid or stuff like this. So there's some coordination between multiple service until you reach get this results. So the questions like why it took 100 milliseconds is approached or addressed by the tracing. Third approach, the third classification of pillars is like basically was this 100 millisecond fast or slow or how fast was these requests during this time of period like today. So this is what we matrix basically for. So if you want to address like draw diagrams you will have the metrics tracing and logging. We'll start with the first one which is the I would say the most basic one. And it's the easiest one basically it's logging. The first task in logging is basically generate these logs. So basically instrumentation for these logs. In Java world you have one called SL4J but it has multiple implementation you can pick your poison basically and it's very simple in code just like whatever you want to like send or like send some information just say hey log info whatever you want with the context like the informations you need. Really it could be also a little bit different in style but it's not very interesting. The main thing that you need to think about is first what are the log level. We have multiple log levels basically the let's say from the errors you just logging basically the errors showing the errors we have the warning, we have the info, we have the tracing and debug. So what kind of information or the volume of information you want you can switch these logs this level to get to see the information. You need also to think about the pattern how the format with the logging will look like because if you have one service like it's fine you can put whatever pattern it's fine but if you have multiple service and every service is basically owned by different team if everyone comes with different pattern then it will be basically MS especially if you want to analyze them later. You have also make sure you need to make sure also you don't expose personal information like username, password, stuff like this in the logs. This is very important. Logs normally it would be just like this is one pattern which is exposed like the couple of like the levels, the timing stamp, the threads, stuff like this. It would look like this in normal logs but nowadays it's more going to the JSON structured logs which is basically the same informations but it's in more structured way. After you generate the logs the second thing that the application need to decide where to send these logs. So one option would be like sending it to files or standard output or even syslog over the networks. The application send it there but you need someone if you have multiple service every service will send to different files you need something basically to collect all these informations we call the collector or transport which we have multiple frameworks like Fluenty or Logstash and once they collect information they do some transformations, they do some filtering and then they store it in some that is stored like Elasticsearch. After that you after the Elasticsearch storing it you attach some UI in it. For instance we have Kavana and hence the name AL key or basically the stack to handle logs. Any question? Or we can save it for the last maybe. The second pillar is basically the tracings. As I said like when you receive a request the question like you are and it took like say 100 milliseconds. You need to understand why it took 100 milliseconds or something like this. Let's suppose that this service is all multiple service in order to get the final results. Basically the tracing will allow you to understand the journey of your request within your system basically. For each hop, each hop will basically create a span will identify by an ID and it will set for instance the durations for each span. The collection of the spans will we call it trace. Basically let's say if this is 30 milliseconds, this 20 or 50, the total will be 100. It will allow you to understand why this where this 100 milliseconds spent. It will also allow you to give you some understanding of the architecture as well because you will see how the flow, the request is going through your systems. Mostly the frameworks currently is like comments. Mostly it's based on Google Dabber paper. We have Jager which is implemented originally by Uber, Zipkin by Twitter and Supreme Cloud Slows which is mostly based on Zipkin as well. I will talk a little bit on just Jager how, what are the components just in general. This is like from their site. In your application, you need to instrument basically the same concept in logging. You need to generate something to add some information or metadata. You will use something called Jager Clients which basically depends on your language like if your language of implementation could be Java or Go or whatever. This client will send all the tracing confirmations to some network agent, network daemons basically called Jager agents which will basically send it back to the collectors. The collector will do some transformation indexing before storing it in the database. Datastore like Cassandra. The datastore will be then can be viewed like you can see how the tracer or the span or whatever like you can see it in some UI. One thing that you need to think about in tracing is you cannot trace all the request because it's really, it will be like high volume and it will be much expensive. Basically you have to deploy a lot of flog collectors and stuff to handle if you want to trace everything. So what you need basically is to just do some sampling on the request and based on that you can understand what's how your system is behaving. The third part is like the third pillar we call metrics. The metrics is more about the trend of the behavior as I said like when this request took 100 milliseconds the question you could ask someone we could ask you was is it like fast enough, is it slow or what's why like how it's compared to other requests in the same time. That's how the metrics, this is metrics basically the problem that metrics solves. It's also basically all the same you have you do some documentation basically originating to some metrics what we call, we will see it later on when we talk about premises. But in general you have like services emitting some metrics which is collected by some server and then it will be stored in something in that store called time series database which is basically at dash every value with time stamps and some tags as well. And you can have some dashboard to understand or build chart based on these metrics you have. And you can also have some alert engines which is basically process or evaluate some rules to see if there's something wrong or not. Besides those three pillars, the main pillars, there's basic questions you need always to think about is the health check. Basically if you don't, you need to understand to answer the question, is your application alive? Like is it working? Because if it's not working you will not have logging, you will not have tracing, you will not have anything basically because all of these should be instrumented from within your application. So this is the health check is about. You have to answer two questions basically is my application alive which is you can implement by having in point called health check or health slash health should have no logic just like a sample rest in point. And the second question is my application ready to receive a new request which is called liveness. Normally you check you provide an in point and within it you check what are the dependencies that your application is counting on. For instance if your application needs database, Kafka or whatever you need to make sure that all these are ready before you say hey please send me some request. One other approach is just like to register your applications in some service discovery like Eureka or other stuff. So far we presented like three type of monitoring so when you use what basically or if you look on the how from the most basic one is the health check basically it can give you answers either yes or no. Basically my application is live or not. Nothing else. The second things metrics like it can give you some trending like is my application going decreasing performance or not. But it cannot give you the reason why it's going it's like for instance the performance is decreasing or not. The way you can justify or understand what's the problem with the performance is to go to maybe check some traces or even go the last results for you is will be the logging basically where you can have all the information units. This is basically setting the foundation for the monitoring pillars before we start with the specific case which is premises. Any question? So I have one question. Has anyone heard of Prometheus before it's been a while since I graduated so this technology wasn't even existing when I was in university no anyone? Okay. I think we have plenty of time. So as Light pointed out sorry about that. Those are basically the all the steps that anyone working on microservices will get through before they can figure out a certain problem or if they want to just figure out what's the current state of the system. Health checks are obviously the most are not so expensive. They're very easy to implement but also they don't bring a lot of information. Your service could be indicating that it's alive. On the other hand logging is something that is semi-automatable it's not really very appropriate to use it in case you have like 300 or 400 services. This means a lot of instances that you need to aggregate information from and it's kind of a problematic approach if you want to get an overview of the system. And for that we have metrics and we're going to talk a bit about them. So metrics logging and traces are all very very broad topics. Probably each and every one of them is enough material for at least one presentation. So in this presentation we decided to talk about Prometheus and how we approach metrics and monitoring in transfer wise. So I'm first going to start with the basic concept that is very relevant for monitoring and this is the concept of push versus pull. So if you have like I don't know 100 or 500 services there's multiple ways for you to distribute data so that you can aggregate it later on. One of the ways is that you have a centralized server and you make all the services start pushing data to this server. And this is the so called obviously push approach. The other approach is the pull approach which is basically you have another server but instead of sending information you rely on this server to gather information from all the microservices. So both approaches have their downsides and benefits probably at least for my opinion the pull approach is slightly better at least for our cases because it will allow you not to perform a denial of service attack over your monitoring system. Which could be problematic because if your monitoring system is down you're basically out of luck. You don't know what's happening and you have to investigate first if monitoring is up, if it's working correctly and then you can see if your business is actually still functioning. So having said that Prometheus is actually a pull based open source monitoring system. It has been developed by SoundCloud I think and it's alive in production since 2012. So it's a fairly established product I would say. What it does is it's because it's pulling information it's relying on the so called scraping pink mechanism which is basically gathering data from the servers. So it's your obligation as a developer to expose a certain endpoint which is consumable by the Prometheus server. And this is the so called scrape protocol I can show you in a minute what it looks like. It's not really a protocol it's more, well you can say it's a protocol but it's more like a key value representation of data. I won't go into much details about the features of this product because there's quite a lot and it can go very much in depth but the ones that I like first of all this server that I'm talking about Prometheus is time series database. And it's capable of storing metrics such as well there's three things that are being stored in the server. One is the metric name which could be anything. It can be temperature, it can be request counter something like this. A timestamp when this metric was scraped by a certain service and also a set of key values which are called labels but you can refer to them as tagging as well because they serve the same concept. One thing which is very interesting from my opinion is that it doesn't rely on distributed storages so it's really very easy to operate. Also you can run as many instances as you want and if one of them dies it's not really a problem. Also if you get duplicate data or duplicate alerts or something like this also it's not really a problem because it's always better to have duplicate data in this domain rather than no data at all. Which could be the case if you rely on distributed storage. It also supports a very flexible query language. We'll see a bit of it in a moment. So this is the architecture. I'm not going to get into too much details but I mentioned a few of the characteristics already. The server is basically along with the UI is basically one running instance. It relies on local storage. It doesn't rely on any input output operations which makes it very fast and reliable. There is also another component which lives outside of the server which is called alert manager. We'll see how it works in the demo. Its sole purpose is to translate any alerts coming to it and distribute them to different systems. Different channels basically if you look at those channels. For example this could be email, this could be Slack messages, this could be pager duty, could be anything. I mentioned that this is the main approach of the system is that it finds servers and reads the information from them as opposed to servers pushing them. But what happens if does anyone have an idea? What happens if there's like a short lived job which only lives like 5 seconds or so? What would happen if the server fails to read the data from this job before it dies or completes correctly? Any ideas? I would say it's a workaround but there is a way to resolve this problem. This service over here acts as another component of the whole architecture and is used if you want to push data instead of waiting for the server to scrape the data from your service. I tend to say it's a workaround because in general as I said Prometheus is a pool based approach and this is a push gateway so it's a bit contradicting but it's not that it's not I wouldn't say it's a real time data, don't think of it like a system where you can stream data or something like this. It's more like you push data and then Prometheus once again scrapes it from here so think of it like a temporary storage where you store your data before it gets scraped anyway. So there's four metric types First and easiest one is counters counters are very simple. They simply represent a single value which can only go up. It's monotonically increasing and people mainly use them for events such as I have given example for logging events. So for example if your server is spitting out errors like this example here you can see that the server generated 153 errors this could be a useful information if you want to establish some alertings on that and we'll see how this works in a minute. It could also be used for counting users and things like that. It could also be used for counting the amount of data sent to someone or received and be used for a lot of things. Now there's a little problem with this approach is that because the servers are not obligated to store the data this counters may reset but Prometheus takes care of it because it remembers the last time it scraped this it scraped the server it remembers the last count so it's kind of a resolved issue. Second metric type is so-called gauges they are again a single value but in this case they only go up and down, they can go down so the way people would use this normally is to track CPU usage, memory usage JVM classes obviously loaded in memory normally they don't go down but it depends on the JVM and how it's implemented so you can basically unload classes if you want to and that's what gauges are used for and there's two more types which are slightly more complicated than the first two so first one is histogram I suppose it's pretty self-explanatory but the idea is that if you want to track distribution of events for example I have an endpoint and I want to figure out how many of the requests took one second or 400 milliseconds and 200 milliseconds basically you want to get information on how data is being distributed, counters are not really useful for that and this is why there is a histogram the way this works is that you create buckets and every time you do a new measurement it gets stored in a bucket but there is a catch because those values that you see are being accumulated so for example if you have a request duration which is 800 milliseconds it will get stored in few of the counters not only in this bucket as you see here and this will answer the question is how many of the requests took 800 milliseconds this will be those one but in this bucket those also took 800 milliseconds so it's fair to duplicate this counter and this is why you will see that with the increasing counter count of each and every metric so does the responding value increase until it reaches infinity and the fourth type is called summary which is very similar the only difference is that well again you have to sort of create buckets in this in this case called quantiles but this time those are not being those are being sorry lost the word there those are being evaluated on client site which means that you can gain more accuracy if you want to but providing metrics for your monitoring system will take a lot of your CPU power and possibly also memory so you get accuracy but you lose some of your resources so you have to be careful with that it can be used for example for things like garbage collection durations in this case go garbage collection durations which is looks like pretty much the histogram it's just that it's evaluated on the server on the client site any questions so far not really okay let's talk about something else then so the next tool which goes hand by hand usually with Prometheus is called Grafana it's also an open source product and its sole purpose is to visualize data people use it a lot because it can integrate with a lot of data sources as I said well obviously Prometheus is one of them but it can be also databases such as influx db or elastic search and honestly I haven't seen anyone use anything more than Grafana but I might be wrong so don't quote me oh Grafana Grafana provides basic alerting capabilities which makes it very easy if you want to track some metrics and get a notification if those are out of order but they're not as good as the one provided by Prometheus why because if you have a very fuzzy diagram you might get multiple notifications which can be very tedious especially if you're sleeping at 3 o'clock in the morning whereas Prometheus is able to deduplicate those events and it will only wake you up once I don't know if that's good or bad but yeah it also has a dashboard marketplace because the dashboards are not so easy to create so there is a marketplace and it's very easy for everyone as a developer to contribute those and to use any predefined dashboards and of course the best part of Grafana is those beautiful dashboards that my manager loves so much okay let's do a little demo so for our demo we have prepared a fairly simple use case I would say it's an application which is running on Kubernetes anyone knows what Kubernetes is? that's great so it's running on Kubernetes which is running on top of Google Cloud and what we have is a very very simple service I will actually show you the code it's so mind blowing it's only 30 lines of code so what it does is really very very simple basically you have one users controller which returns a list of users which are not even random it's John Doe and Jay Doe, my friends everyone has his friends and we also have an errors controller this one will return a status 500 all the time so the reason we would use this is because we want to catch errors so that's about the code just want to start something before so first I want to show you what Prometheus looks like it has a very very simple UI but as I said I don't know if I said it but I'll say it anyway the purpose of Prometheus is to gather metrics it's not to show them for this we have Grafana the UI of Prometheus what you can see is well your configuration which could be something that you might want to debug if you have any issues but you also have the targets in this case it's quite a few but they could be a lot of targets so those are all servers that we're currently scraping for metrics what else we have an alert section and in our case we have only defined two alerts which we will show how they work one is that's basically part of the Prometheus query language so what we're saying is that for a metric named HTTP server request seconds count where status is equal to 500 which means for a time range of one minute if we get an increase so this is a counter and the increase function what it does is if it identifies it the counter has increased it will basically indicate some positive value if it has increased with one you'll get a one here but if it has increased with 500 you'll have obviously an increase of 500 we don't care how much it has increased what we don't care is that if there is an error of type 500 obviously a positive increase for an amount of 30 seconds we will get an alert and the second alert is slightly more Kubernetes oriented but if you have a deployment and this deployment has many replicas but if you have replicas less than one meaning that there are no services basically running we'll get an alert so that's about the UI of Prometheus and this is the Grafana UI now what we're exposing here is quite a lot of information I would say we have some quick facts about the uptime when this will start it we have a lot of information about the input output operations we have a lot of information on JVM memory and what's not memory pools, garbage collection cycles basically everything that you want to know about your application you can display it instrumented and then display it in Grafana but what we're going to do now is I'm going to involve the user's controller which as I said just returns a bunch of user names and I'm just going to involve this endpoint 100 times and see how this will affect the rates now as I mentioned Prometheus is not a real time system so it takes a little bit time until those metrics get to Grafana but not to worry about this it's this chart that we're looking at so we should get a peek because here we see that our little script is constantly calling the API and you can see that the rate here is constantly increasing which is our expectation just see if I can refresh it faster, nope this is as fast as it can go and now if I stop it and if we wait long enough it will just go down but I'm not going to wait for it it's already going down ok so let's do the other part of the exercise instead of calling the user's endpoint I'll just call the error's endpoint it's not going to show a terrible screen it's just going to tell that something terrible happened and what I'm going to do is start invoking this endpoint not as fast as I can I'll just put a little sleep as a real developer and let's see what happens ok what happens here is that we see a peek in errors now this is definitely something that we might want to know about because if you see like a huge peak of errors coming to you it's probably something wrong happening it might need some manual intervention usually there are multiple teams handling this it very much depends on the organizational structure of the company but usually there is like a first line defense of operational teams which are looking at this trying to figure it out by themselves and if not they call the developer and it's game over you just have to switch on your laptop ok but what I wanted to show you is now that we have alerts you can see that this one is already yellow because and it has a status of pending which means that this condition that we have predefined here is now true and now we have to wait for 30 seconds until Prometheus decides that ok I have waited long enough this is an error that I need to send alert ok we waited long enough now you can see that the status is already firing and if you go to alert manager which is another UI you will see that you can actually see that there is an alert named this one and if you switch to well not this one if you switch to our monitoring demo you will see that here we get an alert notification now is this useful enough well probably it depends on how much you want to stress people I mentioned that there are other systems Slack is not the best one if you have a very catastrophical error there's other systems like we use VictorOps for that there is also security which is you don't want to find about this but you get a phone call and you get a message from an automated voice it's not even that personal so I'm going to stop this now oh it's already completed now if I go back to Grafana you will see that ok errors are gone and once again if you wait long enough you will see that this alert will eventually get inactive because why because this condition here will turn into false once again and Prometheus will sort of auto resolve this now in order to make this shorter what I'm going to do next is I'm going to oh just let me show you I told you that currently we are running only one instance and that is it you see it well ok it says running that's very cool now let's kill it there's the one thing that you don't want to see on production but it happens so what I'm going to do is I'm not actually going to kill it I'm just going to say to Kubernetes ok I want to have zero replicas which is basically downscaling it now if you say give me all deployments you will see that the service that was running has exactly zero available instances and now you have a real issue because if you cannot see your instance obviously your users cannot see it as well so this is definitely something that you need to know about and similar as before what we're waiting for is to get another alert and while we wait this one became green I just wanted to show you probably in Slack if it was quick enough it will get a green alert meaning that it was auto resolved probably needs a little bit more time so let's give it a chance and see if the other alert is already active any question anyone have any question yeah not a real time system let's wait any questions that's a good question now let me go a little back back to yeah it is a good question I want to show you this diagram that I said I don't want to spend much time on it so how this works is basically you fit Prometheus server with tons of configuration so if you have like static IPs for all your services you can pull all those poor those IPs in a single configuration but that's not very scalable I mean especially if you're using Kubernetes this doesn't work because if you delete your service if you redeploy it then your IP is definitely so what you want to do is and what Prometheus does is it supports service discovery it doesn't have to be a Kubernetes service discovery it can be other service discoveries they are all supported most of them are supported but how this works is that once you run Prometheus it will go to the Kubernetes API and it will figure out all of the services well all the ones that you need I mean if you are running for example in an isolated environment you may restrict your server so that it only looks for certain services like namespace wise you know but if you want to have a global configuration it can scan all the namespaces basically the entire cluster it's up to how you want to manage it but that's how it works can you show them the configuration targeting the Kubernetes UI yeah I'm going to get back to the target this is the configuration it should be something here well yeah this one is not very takes a little bit time to get used to it but basically what Prometheus does is it creates a bunch of jobs like this one for example is for CAdvisor which is basically scanning the C groups of the Linux servers underneath those are for the service endpoints so basically here you specify what other target you want to have from you can define the entry-valve, security-valve, you can define the metric pass as well so you can define multiple things that is the easiest way to do like study code as Vlad mentioned but the more scalable way is to do it using services yeah was there another question if I get correctly your request to the Prometheus of HCDB post request my request in when you're sending metrics over it's a HCDB request it could be it is basically premises who is getting cured just like you provide gain point and actually can you put the here's what Prometheus is it done oh there, now I need to alert basically we'll show you our services just wanted to show that this one service is already unavailable so someone is getting alert so basically your application will provide some gain point in our case it will be called slash twitter class metrics and you will have a text file it's coming to our premises will get this point, pull the whole request and that's it it's query more than updating so let's scale it up then we can show you yeah let's see if it's up still not ready your application is providing the gain points and Prometheus just call rest get to this service if you can answer to your question it could be HTTP but it could also be HTTPS it very much depends on your configuration from what I understand most metric relations systems just send out HCDB packets instead of actually sending over HCDB because they don't mind losing the data so I was wondering why Prometheus doesn't define losing the data so I'm sending metrics to a certain server and I send by just raw UDP instead of actually sending proper HCDB packets and it gets corrupted along the way or it gets dropped somewhere so that doesn't really matter because as you earlier mentioned you just need a sample of whatever is going on why would you send a UDP package to a really voluminous stream then it makes sense if you want to make it quicker but in this case at least how Prometheus works is it basically scrapes every 30 seconds or one minute depending on your configuration it's not in your interest to lose this but if you lose it it might mean that you actually lost the service it doesn't just pull the required data it doesn't pull one by one in your case basically it's for instance tracing you use UDP because it's sent or to trace the span information for every span by itself so that's why it uses UDP packets because if you have 1000 span you will have 1000 packets but in Prometheus it's just one resting board call and it will get all the metrics like it's really assembled like nothing really this is exactly what Prometheus will say that's why it's really light I would say well it doesn't really make much sense to send metrics one by one because you might lose them one certain amount of them but it's such a small amount of information that it doesn't really matter I mean sometimes like Prometheus can fail scribing in some interval if you say like every 10 seconds you are scribing this data but if you mess one interval it's fine because in the next interval you will build the accumulated values anyways it's not a big deal even if you couldn't manage to get it in the correct times it's not like tweak your interval to make it like real time but it depends on your own actually would it matter if you miss a certain scrape for a counter probably not because your counters are always increasing so if you miss one scrape and get the other one well counter is still there you might miss the value but you still know what's the rate with the increasing rate of this counter so for counters it's not really a problem for gauges it could be a problem if you miss a scrape for example if you're tracking the CPU usage and you miss a scrape and this very time you might have missed a spike of CPU usage but is it that important it very much depends on your system so this is why most of the things that Prometheus do are configurable so normally people do a 30 to 60 seconds scrape intervals but you can get it as low as maybe 5 seconds or so of course the problem with that is that this will put more pressure on your services so you don't want to answer this much information every time unless you want it if you don't need it the general rule is get rid of all the data that you don't need so metrics that you don't use get rid of them amount of metrics that you're getting from if you don't need them you can get rid of them because as I said if you have multiple services it gets you can get a tremendous amount of data so for every service you need to run some instance of Prometheus to be able to get all these metrics well normally each service has to have an endpoint which returns something like this let's say like JVM, GC memory like how are you going to get it without formatting it depends on the tooling so if it's a JVM service if it's a Java based service there's a lot of tooling which allows you to expose those kind of metrics if it's running on Go or Python there's other tools to expose memory or whatever there's no JVM there so it's not handle like Prometheus there's also something called Exporter which is basically send your metrics about the host that what your application is on what your return is basically a metric name probably some labels but you don't have to and some value which could be which is always a number so 0 points whatever that's all you have to return now metric name is something that you predefined Prometheus doesn't care what's the metric name there are some conventions that you have to follow or your advice to follow although you don't have to but this is the name that you put it can be a temperature it can be I don't know some counter could be anything Prometheus scrapes all those metrics you can search to those metrics the same way you search in a database and that's what Prometheus is at the end of the day it's just a database with a lot of capabilities for scraping, for visualizing things like that probably we can show how this looks in Grafana yeah this is Grafana, this is very important functions this is the name of the metrics and here is the survival kind of labels yeah so if you want to add another metric let's say let's say log back events total you will see that Grafana is actually kind enough to find those metrics for you so it could be another any metric the way you define it this is what it matters what you can do is you can say okay I want the log back events total here you see all the labels some of the labels are already instrumented by micrometer this library but some also instrumented by the heaps so it doesn't have to add all the labels by your side yeah so what happens here is you get a lot of those metrics now the difference is that they have different labels so this is actually a very strong point of using Prometheus and the way he exposes data is that if you have multiple services exposing the same metric name the way to distinguish one from another is with labels because one might be coming from one server the other might be coming from another server and the third one might be coming even from another data center who knows it's up to you to pre-define those metrics so that you can read them later on Prometheus doesn't care about this it only cares that you provide it with the sufficient amount of information so that it can scrape those metrics from this point on it's up to your application to expose those metrics and as I said there's a lot of tooling to do this and after that it's up to you to create those beautiful graphics and charts in Grafana Grafana also has the alerts you can also build some basic alerting clears as well based on the queries basically the graph of the chart you built you can also build less to send mail, slack Any other questions? No? Was it useful? Yeah it was the best part for this project in particular the thing is that it's open source so there's a lot of opportunities for people to read the code and see how it works and also contribute even if they want to And it's easy to deploy even in your laptop because it's not that the whole stack like Grafana and Prometheus can be done just like based either using mini-cube like for Kubernetes or like even local image it will be like you can build it or Well, no more questions Just want to take this opportunity and say if you're into that stuff