 Our next presenter is Hineks Slavak. He's a developer in various media. Thank you Hello, hello. So first of all, I would like to thank everyone who actually did what I asked to and came to the front rows Thank you so much for Viewing me. I have also to apologize. I'm a bit Physically in convenience because I managed to Kind of get a blockade in my neck yesterday while doing my hair So if you ever wonder how it is to get old, it's not gray hair It's not being able to do your morning stuff without hurting yourself If you're local and you know some physiotherapist who could like deblock my neck Please come and talk to me after this talk so Other than that So may or may not know me. I work at a small web hosting company called Vary Media and I want to teach you how numbers and colorful graphs can improve your life And the life of those that are impacted if you get paged out of bed at 4 a.m. And My clicker is not attached so concretely, I want you by the end of the talk To be able. I'm not good at multitasking. Please one minute So so I want you by the end of the talk to be able to predict performance problems Because if you can prevent them or if you can just stomp out a little blaze It's so much better than having to fight a huge operations fire If something happens anyway, I want you to be alerted by your system with useful data and not by your hysterical boss on on slack or by your very angry customers on the phone and If a fire is burning, I don't want you to stare at a useless top output Hoping to come with some inspiration while your boss is still poking their finger into your bag I want you to have data about your systems right at hand and Of course in a meaningful format and that also includes historic development because once you run into a fire You want to know how you get how you got there? Especially because the cycles should feedback Once something happens, you want to be alerted the next time and ideally you want to prevent it so If you want to reason about situations like this is good. This is bad This is really bad you need an objective representation for these situations and For that you can express the quality of the level of your service So you may have already seen those words usually with a third different word and so first you need an indicator Which can be the request latency or the uptime something like that once you have an indicator So you have something to talk about you can formulate service level objectives, which can be latency must be always under 100 milliseconds You have to have five or nine uptime Things like these and then finally the probably most famous one Once you have objectives, you can formulate contracts and agreements on top of those numbers Because what will happen if those objectives are missed? Are we gonna sue you? Are we gonna cancel our contract? Now agreements are not part of this talk But SLIs and SLOs are because SLIs are just metrics and SLOs are conditions you want to fulfill in other words. You want to get alerted if you are not fulfilling them So once that back what our metrics metrics are numbers or samples in a database They are timestamped which makes them a time series and You're gonna have to have a lot of time series Which means that you can correlate them for example this example the request latency We just serve a load a very typical use case Now you get those time series by adding instruments to your system and a system can mean anything It can be your app or it can be a server It's just like on a car or on the plane except that these instruments now get hooked up to our time series database That will store them and then depending on what time dot a time series database you're using It will allow you to do queries and operations on top of them now. You have to obviously instrument your app Next next up is probably some dependencies like your database your web server you load the lancer they all carry very and Important and useful information for you to correlate with your application data Then of course your environment your server load your memory your are your activity And finally and that's kind of underappreciated. I think you can also instrument your business Like you the number of your customers the number of your paying customers Maybe are not in San Francisco. So the numbers are more more similar here, but still or your daily revenue so seeing a graph that correlates your front and latency with Your sign-up rates or your revenue can be enlightening sometimes especially if you are Arguing with your boss about where I where I'm not you need that SSD Now nothing of this is new people have been doing this for years. I've been doing this for years Actually, I've been talking about this last year just here but in the past you had to choose multiple components with various various trade-offs and most notably they aren't integrated and this is a bad situation to be in if you Wake up one morning and say okay. I want to to have metrics. What do I do and now? Now you have to learn basically everything and then choose what you want to use and others like stats T They have some really really bad properties, but you don't realize that until the fire is burning So I find that Prometheus is different And that's because it gives you a well-rounded and opinionated matrix and monitoring system, which is integrated It's absolutely flexible, but it has a proven and well-documented starting point So opinions are a dime a dozen obviously But in this case, it's okay to listen because it's more or less a re-implementation of Google's internal Monitoring system that has been implemented by ex-googlers working in that case at soundcloud And they were just missing their pet monitoring system So to give you an idea how it works. Let's have an architecture walk core feature of course is Storage of time series Now a time series is really just a named stream of float samples with time stamp with timestamps That's all but Prometheus wants you to think in Terms of four types that are built on top of these streams. So first there are counters Which are for counting events or counting anything But the important property is that counters can only increase but it can increase by anything So you can use them to measure your network traffic or to counter to your errors, whatever If you need to set arbitrary numbers a gauge is for you a gauge is for exposing numbers And it can be set to anything so it's used for things like servo load temperatures or the number of active requests right now and These two are pretty obvious how they map on a timestamp float stream But the others are more interesting. So a summary Takes measurements so it observe measurements and allows you to compute the rate they come in like requests per second and The average measurement like the average request time Now some clients and python is explicitly not one of them also allow you to define person tiles Which are then computed within the app The reason why it's not in there is that it's not really useful Because you cannot aggregate meaningfully person tiles. It's it's just not how math works so instead You should use histograms and it's like the Working horse of of metrics in this case It's also about observing values and you keep track of averages, but additionally you define buckets And these buckets Should have the typical sizes of the values We are measuring and then promise us can estimate person tiles server side From these buckets which also means that you are not deriving numbers in your application while it's serving some important requests It's a very nice property Now I said person tiles twice now Which is because they are very important. So I give you a quick rundown. So just you're on the same page And it starts with a premise that averages are probably less useful than you might think and to have something concrete to talk about Let's assume we measure request latencies And I think it's fair to say the request latencies are a good indicator of the quality of service Fast requests are good slow requests are bad It doesn't matter if it's a web page or it's a an API in any case you want it to be fast now the average time is Not the average user experience in this case because let's look at this example No user is experiencing a latency of 2.8 in this point So not only is it not the correct answer it's also on muddling all numbers together and You don't you don't see that one request is really really bad while the others are just fine And the problem here is that there are no bell curves in production It's every each production data you will encounter in your life. It's skewed in some way so Yeah, and that means basically you you may be wasting your time on Optimizing a perfectly good average case while there's just some outlier for some reason And you will not you will not ever find it if you do not know that it's an outlier So what is the average experience or what does the average user experience here? It's one and if you remember high school. There is a function that would have told you so it's a medium Which takes a sort of data set and gives you the middle value or the average of the two middle values If it's an even order even size set Now the medium strength in representing the average user in this case is also also its biggest weakness Because this still returns one and I think we can all just agree that this is not a useful information to receive So unfortunately what the median comes from there's more and this brings us back to a percentiles They also partition assorted data set but this time into 100 parts and then you look at the mth value for the nth percentile or The nth percentile P is the upper limit of m percent of data set values Which sounds super confusing, but it just means the following if the fiftieth percentiles is one millisecond Then it means that by one millisecond 50% of your requests are done That's all that it means and if you think about it, that's actually our median again Which again is not useful by itself, but we can go further. We now have a parameter that we can tweak So let's look how long the 95 95 percent of our fastest requests took and we see we have a problem something is very very wrong and something between 50 and 5 percent of our users are affected by this so At this point you can drill deeper because as I said before per meters is Computing its percentiles server side so they are not fixed. You can always look try to find others and Yeah, when an average you wouldn't have gotten any useful either. You would just think that all take forever now The problem with percentiles and not a lot of people talk about that is that they throw away most of the data and That's a problem if you want a representation of your service health or your service quality So in the end you still need the average to have to have a number that is distills Everything and doesn't just look at certain values so Now that we have the math out of the way, let's talk about naming And what we ever used graphite or together with stethc. They will have seen something like that They put the metadata into the metric name which is kind of annoying so any modern tstb and Prometheus one of them switched to bare names So the best practice here is to prepend it with an app name, which is not a good app name It's just a short app name so I can use a big font on my slides and to Append a unit a total is a counter if you are measuring times you would have to seconds or something like that So it's a bit self-explaning now this metadata is added using so-called labels which looks like this and Each new label combination still adds a new time series or how they call it a dimension So that means that you do not get less time series, but it's much more readable. It's structured so you can do much You can do aggregation Aggregations on it in a much nicer way like formulate queries on label values. It's really nice now How do you get those values? And it's very gets kind of Interesting because contrary to to the most metric system Prometheus is pool-based Which means that each instrumented system? exports its metrics via HTTP and Prometheus scrapes them for you So if I'm using the metaphor from before you add instruments to a system and Prometheus looks at it regularly writes them down as a time stamp and is done This means a lot of things So first of all you can adjust the resolution of each single target by configuring how often the metrics are scraped So if you want more frequent scrapes, you get more precise data, but it uses more disk space. It's always a trade-off It also means that if scrapes fail for some reason like say a high load You don't lose data or meaning you just lose the resolution Which is kind of important because your average rates still makes sense Compare that to a push-based approach where lost samples actually mean that your rate is sinking. So it looks like Things are going down. Although it's rising beyond the capacity of your system to report metrics and this makes Prometheus really really great for Monitoring But it's a bad fit for things if you want like accounting. That's a common question on the mailing list You do not get the single values like the single request times You just get averages out of it and can do useful data on it But it's not like an accounting or accounting system So then you have to go for something like postgres or influx DB if you need each single number Now there's a few problems too of course So one is short-lived jobs like your backup script. You're not gonna convert your cron scripts into Into web services just so someone can scrape metrics and there's an official solution for it It's called the push gateway which will receive the data from your short-lived script and it retains them for Prometheus to scrape problem solved Then there's of course the problem of target discovery if you want to scrape something you have to know that it actually exists Some people consider this a problem But it's actually just moved the problem of knowing what your production systems are from monitoring into your metric system because Nagios also needs to know about all your systems, so You're not getting around about telling some system about your systems And you can do it either by configuration This will tell Prometheus to scrape itself. So it gives you the number of time series and your buffer usage This is An exporter a target or an instance means all the same and a group of those together Our circle job. So for example, if you have multiple from user service You could scrape them all there or if you have multiple back ends of the same web service They are one job but multiple instances and now these two values you get automatically for each scraped metric as labels So you can do filtering on top of this and aggregation So in practice, of course, you're not gonna do static configuration You will use some kind of service discovery We personally use console it works great, but people have been using it with other systems very successfully either now there's one final problem, but this is actually a problem and that is Closed or netted or load balance systems like Heroku or end-user appliances that run in the In a local network of a customer Because you cannot expose things really and if you do people may get really mad at you So in the case of Heroku that been talks about an official plug-in as far as I know, there's nothing concrete yet And other than that, there's no really good solution Prometheus is not a good fit for this Generally speaking Prometheus is intended to run in the same network as its targets If you cannot do that you probably have to look elsewhere So but there's a lot of advantages too. So first high availability is super easy You just run multiple Prometheus servers and point them at the same exporters done And that also means that you can have production data in your test environment So for example, we had an intern and we wanted to make him work on our metric system So we never had him attach our production in Prometheus But he had a Prometheus on his notebook And he got access to the metrics and points of the systems that were relevant to his work And he could do everything he needed. That's a very nice property Then um, out-of-detection is really easy If scrape fail, you know, something's fishy Reasoning about how long you didn't hear from a system. So it's probably that is possible, but more complicated What I personally like is the predictable effect on the infrastructure because more traffic does not mean more metric traffic It's always the same. You said once how often you want to be to Scrape your data and that's it Which also means that it does not congest and already busy network if something is going on in your system And finally it means that instrumenting third parties It's pretty easy actually because any production ready system has some kind of instruments that it exposes to its users So any database has a special table with performance metrics web servers have their status pages Java has its JMAX Now we just have to take these metrics and transform them into something that Prometheus understands and it turns out what Prometheus understands It's pretty easy for you to understand too. So let's look how it looks like This is what an exporter exports it's there's always at least the option of the human readable format and In this case, it is the first part of a histogram About request latencies again very bad metric name short metric name for a big font Now this is the first one which is the first part of it and This time series is the number of measurements that have been observed. So how many requests did we observe? and a second one is The sum of the measured times like the total time observed. So in this case, we had 390 requests that all together took 177 point something seconds and this is super cheap to keep track of we're just adding float numbers and these are also literally the samples that Prometheus stores if you're using the This the summary type in Python. This is all you get So to get person tiles, as I've said you also need buckets And they look like this in this case. We have six buckets. The l e label gives you the upper limit that The sample has to fit into it trickles down. So something that fits into 0.5 also points into also fits into 2.0 This is the number of samples that fit into this bucket now Prometheus can interpolate person tiles from this and that's good enough in practice and You can always increase the precision of your person tiles by adding more buckets But you have to make sure that your values distribute evenly over your buckets or Distribute at all because if all values are just in one bucket Prometheus cannot Compute anything meaningful out of it. So please define your buckets based on the latencies you have Not the latencies you would like to have because that's not useful So we have metrics in a database. What do we do with them? We query them and for that we use the Prometheus query language called promql and I don't have enough time to give you a proper intro and there's like really amazing stuff going on You can implement the game of life in it Well, I give you a few examples so You have you will usually have a lot of related time series that you want to aggregate to one or to a few So for example say of many back ends in multiple data centers and You want a total request rate over all back ends? So they will work our self from the inside out. Here's the counter again, which you saw on a slide before and To compute the rate the function needs a so-called range vector So this means this returns a vector or an array of values of the past one minute How many that are depends on the on the granularity of the data? So how often you scrape your targets in the one minute? And the rate function will compute the rate how fast is this is this counter rising and At that point you have to request rate for every single back end in every single data center And now you just sum them up And you have one value, you know the total request rate over everything now What if you want to know the rate of the back ends in one data center? Then you just add a filter It which looks like this and here you can see how nice it works if you have labels That are structured instead of having to work with the dot separated names Um The rest is all the same now if you want to have the request rate for each data center, but Broken down by our center you you drop the filter again, and you tell the sum function to retain the DC label So in this case you get as many rates as you have DC's simple Now what else is interesting? Person tiles of course and Prometheus uses so-called thick one tiles which completely oversimplified our person tiles divided by 100 So this is the 90th person tile and we take the rate of The buckets we just saw before and Histogram quanta will do the rest we So of course this gives us as many histograms as you have data as you have on Label combinations, so you may want to have to you may want to aggregate it, but other than that we have our Person tiles that you always wanted to have Nice, so I hope you have somewhat of a taste how powerful from QLS and Then it is used by all its consumers which most notably are visualizations So there's the internal one which is not pretty, but at least it's not xjs It's nice for playing around drilling in so something's going on Let's quickly look or what could it be and then you use the query elsewhere It's a bit limited because it has only one expression per graph, so you cannot do any correlations sort of ever But you can build dashboards with goat and plates if that's your thing, but it's not mine so Prom-dash has Still the best integration because it used to be the former official visualization thingy But it's deprecated now because graphana has merged official promissory support, so it's deprecated don't bother Go for the real thing. I Think everyone will ever so graphana I think a good measure of people are in this room just to find out how to use graphana or what to do Because it's the best and best-looking dashboard software right now. It has many many integrations You can build dashboards from different sources, so you can introduce Prometheus and still keep your influx DB or graphite And integrate them in one dashboard, which is really nice Especially because it gives you a step-by-step introduction, so yeah use this There's no reason to use anything else The final piece of the puzzle is alerting You can use promql to formulate alert conditions And promql then will push them into a separate demon called the alert manager. So again, example time And let's talk about monitoring for full discs because once a disc is full. It's too late But alerting on some random threshold can lead to noise which leads to alarm fatigue So let's use a crystal ball to be notified in time without noise and for that we want an alert that fires when a disc is going to be full in four hours and This is our crystal ball It's I'm not it's more high school mathematics, and it's called linear regression so in this case If given the samples of the past one hour The disc will have less than zero capacity in four hours and The condition is true for five minutes So a small spike doesn't just fire off some alerts then we want to be alerted How do we want to alert it again? It's completely pluckable it integrates with a lot of notification back ends of course email page or duty web hooks So yes, you can have slack So how do you get this web scale which is I promise a really final part? The answer is federation Prometheus servers can get the data from other permitted servers and the typical use cases are Aggregation which can mean that you have one permit the server per data center or one per team or one per type and Then you aggregate all this data from these permitted source into one big or for down sampling Say you have one really really fast With SSD server, which is scraping all your targets and you have high resolution data, which you want for monitoring say But you also want to save some history of your data how your servers behaved over the years So in this case you would just sample it down to a lower resolution for long-term storage by a second server which has slow discs, but big discs and Yeah, that's all there is to this to it. So You should have a general idea how Prometheus works now. So let's look how to get data into it and There's a lot we can do without touching your code. So let's let's start with all breaking things and Prometheus have been public for over a year. It has a very active ecosystem. The 1.0 by the way Has been released. I think this week and I've already pointed out that it's easy to write exporters for third-party things and that's the reason why there are so many already and Includes bridges, which is really cool because it means you can use your existing instrumentation pointed at these At these exporters and they will transform Whatever you are doing right now into Prometheus format and Prometheus can give you the nice alerting and graphing and whatnot So Native is better though. So let's start with platforms first fully featured servers There's the official node exporter that it will instrument your service from the inside like metal KVM LXT Now, you know what picture comes next One process containers they like Docker, of course, they are instrumented from the outside using container APIs and It's called see advisor. It's not Prometheus specific and I believe it's from Google So depending on how you run your system Decide and installing such a demon gives you full system inside you get statistics about CPU memory network IO and much much more and this is super useful if you want to put if you want to put your own metrics into context so installation of these should be an automatic part of provisioning new servers and not something you have to remember or Only do when you think about it then Another non-intrusive method is mtail mtail will follow any log file And it will compute metrics on the fly based on regular expressions. That's very powerful and In some cases like the Apache web server you even get better metrics if you set a custom log format and Use certain regular expressions to extract them. It's better than the status page. That's it's serving So you should definitely look check it out Now no matter whether status pages or log files you should always instrument the outer edge of your infrastructure Which usually is some web server or better Something like an haproxy load balancer Now if you look from the outside there are also black box exporters So think ping them just for free. They will probe your system using HTTP TCP or even ICMP aka ping But they add additional load which nothing of what we talked before really does Then again databases every database even one go has some some way to get that out of it Use it. And if you run your own infrastructure, there's also an SNMP Explorer So at this point we have already detailed information about our platform We know how to look at your app from the outside by analyzing logs or even probing it And we know how or that we can instrument third-party dependencies So assuming you instrument your web server, you can already correlate request times with platform metrics like Like the server load and dependency metrics like what the hell is going on in your postgres This is good, but we need to do deeper and for that. Oh, sorry. I forgot clicking. I'm so excited So we have to touch your code There it is and To make things interesting. We'll use an example and Since it's a computer conference example involves cats, so Let's assume you've built a groundbreaking product software determines that a photo contains a cat So now you need to deploy it as an HTTP service where the user posts a picture When you reply with a meow or a no depending on what the picture contains So how hard can it be? Let's build a flask web service and You don't need to know really no flask to understand this You just check authentication Which because your colleagues read hacker news is a microservice written in go deployed on Docker and You have an expensive computation that does the actual business logic which the important fact is this a cat Now I bet lots of you have already written apis like this. It's really fast. It's really cheap now Let's instrument it and for that we use the official Prometheus client package and Even before we change code we do the least we can do we just start the HTTP endpoint Which then runs in a separate thread why? Because on Linux You get process statistics for free immediately and that includes your memory usage Yep, the timestamp of when your process started your CPU time The number of open files and the maximum number of open files So without changing a light of line of code really you can already Detect memory and father in the leaks which happen and they are really painful when your server just stops accepting Connections and you don't know why and you can monitor whether we approach the fd system limit Nice, but let's start instrumenting and for that we defined some metrics first a histogram that measures our request latency Then a histogram that looks how long the actual and analysis takes and Finally a gauge that will tell us how many requests are active right now and now we add them to the app So we just add these two Decorators that do exactly what they sound like the one tracks how many function calls on progress, which is how many Views and progress and the other one measures the time that are spent that is spent in this function Now you might be saying that middleware would be much better because you can have labels with the view name and status code and You'd be completely right. Please do that. I do that, but Workzeug middleware is a bit out of scope here. So additionally we measure the time Come on, you measure the time to analyze because for all we know all the time sinks into authentication Which in turn is not instrumented at least ostensibly so and it is because I've decided to make it a shared package And you should instrument the package itself Because if you use some package ten times, why should you instrument it ten times? So again, we define a metric with the time spent I want for errors And that's especially because as I've said it's a microservice which makes it a distributed system Which makes it fail in the most inconvenient ways in a most inconvenient time So you have to look out for that so whenever we fail we increments the error and Yeah, we try again and I'm averted. This is not how you retry in a distributed system so if the rate goes up you have a problem and a big problem, but we also Count the invalid login login attempts because they are a red flag too because either you may be under attack or You have some subtle failure in your authentication server which manifests itself as wrong credentials But actually just means that someone changed that data format or something now These metrics have the same name in every app you use them and you differentiate them using the the job label so If done properly, which means you instrument your shared libraries You put that related metrics into middleware or even into your whiskey container because both G unicorn and Especially micro whiskey offer a lot of possibilities to hook in into them You're left with one extra line of you which is both tolerable and I really think We should not be ashamed feel ashamed about instrumentation I'm kind of allergic to having a lot of instrumentation that repeats itself in your code And it looks it pollutes everything and you should still really try to pull things out into decorators and Middlewares, but in the end any serious Production software has instrumentation anything that you connect to with your with your web apps or whatever you are writing So do it too. Nobody ever regretted to have too much information if things go sideways Now you may be asking what about a sink Well, you may not but I do and That's why I've written from ethos a sink which Supports a sink IO and twisted and does the right thing with deferred and coroutines And because I'm bad at math. I did not re-implement the metric logic But instead I simply wrap the metrics from the official client and it's all there is This allows you to use the official client and as acing IO applications Now it comes also with a few goodies So let me call them out it has an AI over cheeky P based metrics exporter That is much more flexible and configurable than the one that comes with the official client and You can start it in a separate thread which means it's useful with any Python 3 application out there You do not have to use it with a sink IO applications. I personally use it with my pyramid apps. It's just I Just need a configurability then it also includes Out the registration with a console agent Which is because we use console but service discovery is kept completely generic So whatever you use you just have to write two functions to integrate it with your favorite ones. So it basically means you just say in your own code as start metrics endpoint and register it and as soon as your Metrics are up console will know about it and console is very well integrated from ethos So it's very little over over load once or overhead for you to get this working once You've put the pieces in place So time is running out But everything is instrumented. So let's wrap up really fast And what did I promise I promised prediction? If you could have good dashboards if you use predict linear linear Or they even better hold winters, which allows you to apply a smoothing factor that will favor older or newer depending how we set it values You're just fine Alerting there's alert manager. There's a very powerful way to interact with it and it integrates with almost everything And then there's the holistic overview So and if you instrument widely you will have the data Everything you can build dashboards you can play with promql. You have everything you need if the Feces hit the fan and this is not theoretical last week We had a really big emergence operational emergency in our company, which was not our fault we ran into a very obscure bug that only happens on obscure platforms free bestie and So while the operational staff I'm born developer site was busy trying to contain the fire. I've built a dashboard for them So we could immediately see We try this what happens? Oh, it is still a rising. Let's try something different This is very useful if you don't have to just keep pressing up time or staring at top I believe I've covered everything. So I hope you're eager to measure all the things Please study the talk page as always it contains all the links all the projects follow me on Twitter Get your domains from viral media and I'm not taking questions because I'm really bad in understanding questions on stage But if you have any questions, I'm out there. I'm here until Sunday. Just come and chat me up. Thank you