 Fun, we are extremely excited to be here at FOSDEM and speak about the topic that we are truly passionate about, which is observability, prometheus, and a bit of going. And we hope this talk is really actionable for you and insightful. This is because at the end of this talk, I would like you to know three things. First of all, why instrumenting application with metrics is really, really essential. Secondly, how to do it quickly with make sure that you have metrics for prometheus to use. And last but not the least, what are the common mistakes that you should avoid? And the mistakes that we've seen during our work with prometheus metrics going in amazing, but sometimes wild open source world. So before that short introduction, my name is Bartek Podka. I am engineer working at Red Hat team, monitoring OpenShift team. And I love open source, I love solving problems, usually using Go. And I'm part of the prometheus team and I also co-author of the Thanos project, which essentially is a durable system scaling that scales prometheus. And with me there is Camel. Hello everyone, my name is Camel. I also work for OpenShift monitoring team. I love working with prometheus, Go and Kubernetes as well. And I'm also a contributor to Thanos. So let's have fun. Let's try, well, we'll talk about today about building a load balancer in Go kind of. For demo proposals, let's imagine that we want to implement application level HTTP load balancer. And let's say we'll do it in Go, because why not? But essentially programming language here doesn't matter. We just choose Go because most of infrastructure stuff we are doing are in this language. So let's say we implemented transparent load balancer like this from high level view. We have a couple of components in Go. So first of all, we have like single HTTP server that implements serve HTTP method. So handler via a really nice reverse proxy structure available in HTTP utility standard library in Go. And then this reverse proxy allows us to inject and custom transport, which is like run tripper. And so we implemented our own transport called LBTransport, which essentially I have few components. First of all, it has discover, which gives us a target to proxy to. And then based on those targets, run-throbbing picker picks the one in run-throbbing fair manner. So first replica one, then replica two, three, one, two, three. So, and then the load balancer forwards the request to the given replica and proxy the response back to the user. So this looks great. This looks like, well, this implementation should work. Why not? So let's say we deploy this, you know, couple of replicas of this behind some microservices in front of some microservices and we let it run for longer time. And, you know, as soon as it runs, I will hit the LB endpoint, try it out, it works. So, well, we are done, right? Like it works. Well, maybe not necessarily. Like it works for me, but does it work for other users? Like what about if I'm not checking, does it work at that time as well? Like do we have an information about like how many errors it, how many request error like failed essentially with like 502 status code, which means like the load balancer could not proxy the request. Like we don't really have this information, right? And what about other questions? Like is run-throbbing picker actually working? Is it actually picking in a fair manner? Is this, is the replica number two actually having one fair for the requests? And what was the distribution like yesterday? Like how to tell that? And, you know, other questions like what about latency? Like the load balancer is like endpoint is very slow. And like, is it my target being slow? Or maybe load balancer introduced some latency to the request, right? And finally, you know, maybe you are having some incident and you want to look what was the version yesterday that was running at 2 p.m. And yeah, what was rolled back then? And this insight like it's missing in our implementation, right? So what's missing is essentially a monitoring. So we can have lots of those questions and we need to know, it would be nice to have answers for those even though we don't know the questions even beforehand, right? That's why Insight Reliability Engineering book I made written by Google guys. You will find that monitoring is the foundation of having a production system and running production system before implementing the system itself. And, you know, as you might be familiar, some monitoring signals are metrics, logs, tracing. But guess which signal will help us in this case and will give us the quickest answer to those questions. And yeah, the answer is metrics. Metrics most likely give us the answer. The answer that in comparison to logs and traces and maybe profiling is cheaper, new real time and definitely actionable. So we can alert and act. Either humans can act on it or maybe computers as well. And in practice, metrics probably should be your first item on the monitoring to the list. And as Bjorn mentioned on the talk before us, instrument first, ask questions later. So that's, and you know, why Prometheus, secondly? Well, I might be biased but Prometheus for me is the simplest and the cheapest option for collecting, storing and querying metrics right now. It fits both, it's part of CNCF and also fits simple applications like small applications but also bigger ones with help of other projects like Cortex Thanos and 3DB in the CNCF space. So, well, let's try to instrument our load balancer with metrics and well, like let's try to answer this exact question, like how many, what's the error rate of our requests that users are seeing? To do so, we could introduce a counter, right? And we can essentially have this information by incrementing this counter every time we hit, we have a request in our server and we can report the method that was used and the response status code that was returned by the server. So let's try to introduce this metric in Golang and we start that by, I hope like you are familiar with that but we'll go through this pretty quickly. So first of all, like we're talking again about Golang. However, there are like 18 other language, programming languages supported by other libraries in that languages and but here, we need to first import the official client Golang library and to add the metric, we need to define a variable. We have to choose some names, some descriptions, help and labels and labels are our dimension for the metric, right? So each unique value in any of those labels will result in totally new series and that's really important information as well. Next step is to actually use this variable, use this counter, right? So we have like a server HTTP wrapper that will do the request and then record the status and increment our counter using add one or there's even ink method that adds one more to the value. And something that's really easy to forget is that the fact that we need to register the metric and we do that via package level must register function. Then thanks to that, we can add another endpoint to our server slash metrics, which is like the convention and this endpoint will give returns and metric exposition format, text metric exposition format like this. So it gives us the metrics that we registered via this register. So once we add that to our own balancer, we can connect from two. So it's like a single binary that you run in the same cluster and point to the load balancer slash metric endpoint and it will periodically collect those metrics, let's say every 15 seconds and this way you can then go to the UI after some time, the Prometheus UI or Grafana and actually use those metrics. So here we can see the number of requests per minute by code and method. And you can see that we have like 120 total. Some of them are errors, most of them are successes. So this is great. Like our load balancer has some insights allow us to debug stuff and know what is happening inside. So this looks easy, right? We use few steps, few lines of code and we have metrics. So everything is perfect. Well, not really. There are some edge cases, right? It looked easy but in most cases, well in some cases during our work we've seen that it can cause some problems. So and this is what we learned during reviewing and developing and instrumentation code that is meant to be run on production in close but mainly on open source. So we'll go with Camel with few lessons more advanced issues and how to resolve them. First one, PFOL number one, Global Registry. So there is a saying in very good Peter Borgon blog post at the theory of modern ago. A global magic is bad, a global state is magic. And that's certainly true in all of those cases, especially if you use Prometheus client to instrument your application with metrics and application I mean, let's say you instrument a library, right? And you can imagine that this library will be used by your project but also for other users in open source. And this can cause problems and you will see why in second. And especially this is important because somehow in Prometheus ecosystem like this pattern of using global registry leaked as a good pattern, it is not. And we are really, we're trying to make, don't laugh. It's really wrong. And I don't know, but like it's essentially we want to make sure it's obsolete and we'll show you better way and essentially what are the consequences if you are using a global registry, right? So let's take our example again and as you might see, there are two global states here. First of all, we have a global variable. Secondly, we have a global registry. So mass register hides underneath the default register struct that holds the global state per package. Now, what's the issue? What's the problem, right? So first of all, it's a magic. So it means that if another package just, you just import it or dependency of the packages you imported, registered a metric with exactly the same name that you want to register right now, it will explode like everything will panic. And the problem is it will panic with the stack trace of the second registration and you have no idea what was the first registration at all. Like you need to dig through code like, I don't know, like reg apps or whatever. Just change your metric name. Like it's magic. And like I had this problem so many times even on standard to go libraries when you kind of register like goal line memory kind of metrics as well. Like this is very common thing. This is why you should not use global registry. The second problem is lack of flexibility. So let's imagine we have not one endpoint but three. One, two, three. And we just increment metric on each of those. Now you can see we don't have really much of the information here. Like let's say if I want to know what's the error rate of endpoint of the request against endpoint number three. We don't know because everything is grouped together. Now let's say I want to actually have that information, right? What should I do in this case? Well, remove the global state, right? So let's fix this problem. First of all, instead of our global variable, we create a struct that wraps the variable and you can instantiate finally those variables as you want. So finally you can have many of those metrics. However, if you start to use in this way and use our new metric, sorry, it's still registered in a global state. So let's remove that default registry with custom registered that you can instantiate again and you can inject inside our new server metrics. And in this way, you are in the control of your registry because I just created that. I added some metrics. I can add more metrics like default Golang one as well. And I'm in control. I know explicitly what I added to that. So I can test it. I can do whatever. And it's super powerful. So let's see if we can achieve our goal with flexibility as well. This will panic, obviously, because we just use those. We try to register three times metric with exactly the same name. So that's a problem. However, we can resolve that very, very easily with something called wrap with labels method, function, sorry, which allows you to inject a certain static labels or prefix to the metrics. And that's exactly what we need here, right? Because we want to group our metrics per handler. So suddenly, thanks to that, we know what's the error rate of the endpoint number three. So this is the pattern I would suggest instead of doing default registry. And this is what we already do, for example, in Thanos project for every package we use. It's somehow more code, but trust me, it's worth it. But pitfall number two, no tests. And this is something I'm really passionate about because metrics and other observability signals like tracing the logs are never tested. Like, literally, who tests the log line if it's actually logged in a proper moment with the proper message? Like, no one do that because usually, log lines are for humans, right? They just need to read. So maybe exact message doesn't matter. However, I would argue that for metrics, well, it's really important to test it. Like, the reliability of the metric and you depend on the metric so much that you should actually test it. And I will tell you, I mean, let's see why we should do that, first of all. So let's take our example of our load balancer with our metric. And we are solid 10X developers, obviously. So we want to test our load balancer properly. So we created a unit test. So what we did was essentially we have LB transport run sweeper. We mocked different things like discover and run throbier. We mocked our targets to be unavailable just for simple test case. And we mocked responses as well and a request to test if, and how the test looks like. Essentially, we send free request and because nothing is available, we should expect have, we should assert on free responses being the 502. Really easy, this test passed. It looks in the code like this. You have a unit test, you start a load balancer and then you send free requests by using HTTP tests. Very useful package, standard package as well. You record the response of this and assure that it's 502 three times, easy. So now we are kind of safe, right? Like everything, we test everything, like we can do different test cases as well, but we are fine, right? Well, not really because imagine we made a bug in instrumenting the metric and suddenly we forgot to instrument properly the code and we always put 200. So what's happened? Well, you send free requests, you assert on free responses 502, but suddenly if you deploy that on production, if you have some errors, the metric page would show you like success, everything is fine. And like it's super easy to make those bugs, like it's a code like any others, so this is pretty serious. And why is serious? Like someone can say, okay, this is like analytic thing, like it's a small bug, maybe mislead some people, who cares? Well, not because like, look at this, like this is the alert that is very, very popular nowadays. Like essentially it's alerts on symptoms of the user not being able to use your service, right? So, and it actually relies on the metric and suddenly if you have like all successes reported by this metric, even though there were lots of errors, this alert will not fire. And you rely on this alert to be fired during night, for example, during incidents. So it's really bad if you don't test your metric. So what we want to do is to extend this test a little bit and first of all, because there are no traffic, no request, we assert on zero cardinality, so there should be no series given by this metric. After the traffic, after our free requests, we should steal or assert on kind of correctness of our application, but then assert also about cardinality that, hey, we should have one metric and we also should assert that this one metric should have a value free for 502 code. And let's see how we can do that in the go. Again, we have nice server metrics structure, we have our unit test, we use really handy Prometheus test utility package. And then if before we do anything, we assert that this metric is not incremented, there is no metric return, kind of there is no cardinality. And we do that with the call it and count function. Then we do the request, assert the correctness, and then check if there is one cardinality at the end. And then we check the value of free using two float 64, very explicit name. And check if the value is free. Now what is powerful here is that when I will run this test and I have the bug I mentioned, it will fail. Like it is exactly what we wanted to achieve, right? So this is really powerful. Like with few lines of code, you can add on top of our existing unit tests, some tests against your observability. And I really encourage you to do so. And we do that in the Thanos code if you want to look on examples. So now I'll let Camel to talk about other pitfalls. Yes, all right. Yeah, one of the other pitfall number trees about lack of consistency. There are some high level methods that we can use with the metrics to observe our applications. Which one of like the four golden signals from the Google SRE book, use method and red method, which is by Tom Wilkie, if it's over there. And let's for that take red method. So these methods kind of help us because like with the help of these methods, we define some predefined signals so that we can potentially debug and alert on our application. Also, with the help of these common methods, we can reuse prebuilt alerts and rules and dashboards as well. So for the red method, R stands for request per seconds, E for errors and D for duration. So for our demo application, let's again define an HTTP request total method. And in this method, we can actually aggregate and track requests. So for request, we already have code and method labels. So we can track them by code and method. For the errors, we can actually write some queries on status code. But we don't have duration just by using a counter. So we need to introduce a histogram for that. So by introducing a histogram, we can track the duration so that we can calculate our latency. So reach like by just having, by just conforming a couple of conventions, we will end up using really cool libraries like monitoring mixings and whatnot. So of course, another pitfall is about naming and it's of course one of the common pitfall because naming is one of the hardest problems in computer science. But it's supposed to be easy for meters because we have an official documentation. So it's really, really good written and you just need to use it. And yeah, it's helpful, just wait. To like demonstrate or emphasize a couple of points from that documentation, like you need to suffix your methods with base units. The base form is important so that you can transform your units between other units practically. For accumulating counters, you need to use total as a suffix. This actually will be mandatory bit open metrics. So maybe it's a good time to just try to convert your metrics. And please use info suffix for your meta information metrics. All right, again, about naming, there's another aspect of it, it's stability. So if you use, let's see an example for that again. Again, same example, HTTP request total. If you define this and for whatever reason at certain, oh, sorry, and you decide to use that in an alert and for whatever reason at some certain point you decide to change the name and put a protocol in it. Now you actually break your alert and this won't fire but it also won't fail. So you can create this implicit errors for your system. So just be consistent and be careful with your names. All right, pitfall number five, cardinality. So when we talk about prometheus and performance it always comes down to cardinality. So what is cardinality actually in this case? In prometheus context, the cardinality is the amount of unique time series that you have in your system. And don't forget, like whenever you use a unique label that will get your metric name it creates another time series. So labels are so powerful but however you should just be very careful, very considerate when you use them. So let's again see an example for this. Again, we are using the same metric. We are at this point very familiar with this one. And we want now, like we try to track the request total with just adding paths. So we want to track our metrics per path. So it looks good. Now we have some paths and we can get our numbers. However, if we look closely we also see a lot of random stuff as well. Because like internet is not really a safe place. If you just put something in your label without any like preventive measures you can end up with like really bad situation. If you want to track your discrete events please try to use a logging system instead of a metric system. All right, another aspect of cardinality is about histograms. We already learned a lot about histograms from beyondstalk but like what is actually the problem? By default histograms underneath they are just a couple of counters for a counter for each bucket and then some sum and counter as well. So if you use just default values for your buckets from client Go library you will end up, you will start with 12 counters. So if you just put some labels in it things could get out of control pretty quickly. So again, let's see an example. Now we will, for this example we will use HTTP orchestration seconds metric. And it's like we run our application we collected a couple of observations. It looks good but we have a problem actually here because these are cumulative metrics and like actually our latency is a bit higher but we have like lesser bucket boundaries. So these are not granular enough so we may not take actions on them. So for that let's try to add more buckets into the equation. Now we have a more granular picture now we can actually use them. It's fine, right? What could go wrong? But probably at a certain point you will get alert and it will tell you that your prometheus has some increased memory consumption. And when you check your metrics you will see this because now whatever label you got whatever label you put in there it's just multiply by 12. So especially when you are using histograms please be careful, be so considerate and just don't put random values in there. So which actually gets me to my next point our last pitfall, poorly chosen histogram buckets. So underneath as Bjorn told us we have arbitrary we have buckets, we buy we have buckets because we wanna use less memory. So by just aggregating all those observations in the client side we actually gain a lot from the memory. So this is what makes prometheus histograms that is powerful. So as a result you need to be very careful about your accuracy versus your performance. So again let's see an example. We will use the same metric but we will come up with an arbitrary number of buckets. And when we observe again these are not balanced and this won't give us any value. So let's fix this by using some convenience method from prometheus and now we have better distribution. So just coming up with the correct bucket layout is the art form. So for that you need to know your distribution very well and you always need to keep in mind that like there's a trade-off between accuracy and cardinality when you are choosing your label layouts, bucket layouts basically. So in summary, whatever you do observe your applications. Observation is essential, it's not optional. So determine your service level objectives, write your alerts, build your graphs and use them. And now since you actually depend on your metrics you rely on them and they are now your reliability. So test them as you test your business logic. And last but not least avoid global state. Make your life easier just for yourself. All the codes that you see observe this load balancer. It's actually, we have a working example and we have more in there if you wanna dig in, if you wanna check more just go in there. And I think that's it from us. Thank you for listening. Also we are hiring, so thank you. So thank you, partner.