 Yeah, okay. Yeah, so I'm Fabian. I work at chorus where I do Prometheus stuff and some committed related things. And I want to talk about alerting with time series today. So in general, this is all like high level stuff. It's not tied to Prometheus, but of course that's the kind of things we implemented in Prometheus. So first you're gonna look at what a time series is. And essentially it's just a stream of timestamp value pairs associated with some sort of identifier. In the Prometheus case, this looks like this year, we have a metric called H2P Request Total and then there are some dimensions associated with it. So you can basically have multiple series for single metric that gives you insight into certain dimensions. And then you just have a bunch of samples, right? So we have three values here and each has a unique timestamp. And then this can in total look like this. We have a metric and then we have, for example, a request pairs, the status code and each unique combinations of labels and their values is a new time series. And then you can do cool stuff with them, right? If you have a query language, you can do aggregations and other calculations. So here we have a rate function, which takes a window of five minutes over all the series associated with a metric named H2P Request Total and the drop equals nginx. And then we get a request rate and then we can aggregate these again by certain labels. In this case, here, request pairs and the status code. And what you get is pairs of pairs and status and the rate value for each. And we can evaluate the same stuff over a range of time and then we get essentially new series. And quickly, in overview of how Prometheus collects time series data, Prometheus talks to the services cover systems. You can hook up anything you want, basically. We have native integrations for Kubernetes, AWS, Consul, et cetera, but you can also plug in your custom ones. And by that, Prometheus always knows the state of the world, right? Your services cover system is a source of truth and Prometheus just always stays in sync. And so it knows we are services or applications we monitor are running and can go out and just scrape these four metrics which are exposed in an open format. And then it stores all these time series data. And then you have an API where you can hook up a partner or any custom application. And you can also evaluate alerting rules, as we will see later, on the collected data. And the benefit of having time series data as your foundation for alerting is basically that it solves a lot of problems you would have to resolve in your alerting system once again. So, for example, you have a lot of traffic to monitor. Let's say you have some sort of web service and now you're getting scraped or DDoS or however you want to call it. So, some external styles is basically increasing the traffic. And this should not cause your monitoring traffic to increase. In an event-based system, this is the case, right? So, basically, people are DDoSing you and you are now DDoSing yourself again by just generating more monitoring traffic. And this doesn't happen in the Provenceus case because we are collecting events already time aggregated from the client. And so, the same for sets of alerting. If you alert based on time series data, we don't even affect it at all by any external traffic. And then we have a lot of stuff to monitor, at least potentially, right? If you have large infrastructure with microservices and you're running hundreds of thousands of service instances, you just have a lot of stuff you have to basically track. And the services system is helpful here because it always tells you what is where and your monitoring system just goes out and scrapes the stuff and you just have one pool of time series data and the alerting system doesn't even have to know where it's coming from. And this also helps with constant change that's happening in infrastructure, right? If you are scanning deployments up or down or the rolling releases, new instances appear. And if your time series system already covers this, your data system doesn't have to even know about this change. And then you want, in general, a fleet of an overview. And that's where the data model of time series with dimensions helps you because you have a query language that allows you to calculate aggregations and get this high level overview of very complex and rich raw data. But as we start this raw data, we are still able to drill down, right? We never lose any detail because all the aggregations happen at query time. So all these benefits of time series sort of directly translate into alerting if you build alerting on top of this time series data. Oh, different point. And also you want to monitor ideally all levels with the same system, right? You want to monitor your switches, your routers, your nodes, your applications running on them. Everything with one system because it makes it easier to correlate data about different things. And you don't have to basically teach all your engineers and teams different systems to use. And this also then, of course, directly translates into your alerting. And how can we make sense of all this time series data? It's potentially billions and billions of samples every day about all aspects of the infrastructure. And I somehow have to turn into something that allows you to have meaningful alerting. And the go-to answer often is do some anomaly detection, right? You have big data. So do big data stuff with it. The problem with that is that if you're actually monitoring at scale and collecting really, really rich data, something is always correlating. So some smart machine learning algorithm might be able to detect problems in 1,000 different time series. But if you have 20, 30, 40, 50 million time series, there's always some noise that is probably triggering some signal. And then you get paged a lot. And that's just not healthy, right? If 99% of your pages are actually not something that's meaningful, people stop paying attention and stop actually reacting to pages. So you can't have that. Then you have to tweak your methods. So we have less false positive. At some point, this will mean that you will have false negatives. And that's something you cannot solve it again. A single false negative potentially means that your entire infrastructure is down and now it's being notified. So in Parisiers or in general with time series based alerting, we try to solve this differently. We try to be very explicit about what is the desired condition and what in which condition are we not happy anymore? And it's quite simple, actually, because you have a current state which is captured by time series. And you should be able to very explicitly define what's my desired state? What should things be like? And the delta of that is what you want to lower down. That's basically it. And time series based alerting is basically about finding this delta. And essentially, that's something Brian talked about this morning already. You want to have symptom based pages. So anything that actually wakes someone up should be something that's an urgent issue affecting your user. And that's also actionable. And this means you have a system that you want to define alerting for. You look at the boundary between the system user and the system itself. The user can be an end user, like a customer, but it can also be an internal service that's basically requesting your service or database, etc. And of course, any system often has dependencies as well. But these actually don't matter because the only thing we care about is if our user is having a good experience. And if our dependency is affected, this might or might not mean that my user is affected. And there are actually four basic signals that you can use to define alerting with. And one of them is latency. So if you're providing a service, there's always some sort of time bound, right? You want to provide the service at a reasonable amount of time. Users don't want to wait for 10 minutes until they see a website. So you probably want it to be at the order of 10 milliseconds to, I don't know, two seconds. That's basically something really simple, right? You can define an alert on latency. And if this is violated, then your user is having a bad experience and you can page someone. There's also traffic. So if you have a service, you expect a service to have traffic. And if the traffic drops to zero, something's probably wrong. And closely related, there are errors. So if you provide a service, you don't want all requests to error or, like, too many requests to error. But there's always some sort of percentage, right? So you probably have a tolerance threshold of 0.5% of your request being an outer error. And of course above that, you also want to alert. And then there are call space warnings. And they are different. So you can alert on a lot of other stuff as well. You can alert on system internals. For example, garbage collection being too slow. You can alert on dependencies of your service being down or not reachable or slow. But all these don't necessarily matter. So if you design a resilient system, it might actually be made to deal with dependencies being slow. And the user's not affected. So the operator of the system does not have to be paged. So in general, everything that's not the boundary between the system and the user is helpful to investigate an actual issue, but you don't want to page on this stuff. That's one exception, basically. And that's the false golden signal, saturation or capacity. So for example, disk space running out, fighter skippers running out, memory running out, all this kind of stuff. And that's stuff where you can actually sort of predict where it's going. And if your disk space is going to be full in one hour, you're going to have a really big problem in one hour. So that's probably something you want to detect early before there's an actual symptom and fix before end. So now, for example, how can we do this time to space alerting? And we just take Prometheus because it's kind of what we do and it's easy to understand. You define alerts on the machine collecting the data. So the Prometheus instance collecting data is also the one that's evaluating whether the data indicates an alert. And we have a simple DSL for that. So you specify an alert by the alert keyword and then you provide a name. And then you provide any Prometheus query language expression. And that's the same query language you use for graphing. So you can directly take a graphing query and put it into an alerting rule and it will be fully working alert. Then you can define a fall condition, which is kind of a time period that avoids flapping alerts. So you say if this triggers for five minutes, then actually send something out because it might just be a sort of blip which sort of heals itself and then you don't have to page someone. And later the annotations are just context information you can attach to alerts being sent out. We will see this later but it's not so important for now. And what you will get out from an alerting query is the same that you would get from a regular Prometheus query. You essentially get back a vector of elements where each element is a label set and then you have one value associated with it. So for example we have a metric called etcd has leader which is one if an etcd instance has a leader and zero if not and we split this out by the instance name. ABC in this case and job which just indicates a service. So here we have obviously instance A and B without the leader and see can see a leader and now we want to alert on instances that don't have a leader. That's super straightforward to just say etcd has leader equals zero and if that condition is true for one minute then you send out an alert and you attach the label service to page which is later used downstream in your routing system to define how to notify about a certain alert. Yeah and the alert in the end is also just a label set as you can see him. So now two more complex examples. We have a metric request total and that's just a counter which is incremental for every request the application receives and it's split out by the request pass and by the instance obviously and by the request method. And in the same way we have a counter for errors for these requests which is spread out by the same split out by the same pass. So in the end request errors divided by request total will give you the percentage of errors you currently are seeing. So we want to define an alert and first of all we have to take a rate because this is just a counter and we want to get the per second value over a certain time we know in this case you have five minutes and then we have for every single series we have now a request error per second rate and we want to sum this up right because if you have 200 service instances and this would otherwise yield up to 200 alerts. So we sum this up and say if this is greater than 500 so more than 500 errors a second we want to alert. And the problem here is that this is an absolute threshold. So this alert will need constant tuning whenever your scale changes which can mean different things. So either over a day your traffic change is right you have sort of a curve of your users being active or not. So over time the threshold here in the dashed line changes its semantic meaning with respect to the total traffic you have. So this is not really good because at the spikes throughout the day it will trigger even though nothing really changed. In other cases traffic over months right you are having have a growing user base and people are just using a service more and now you always have to go and check that my total traffic increase and so adjust this alert again and again. And similarly just spikes for example if you get I don't know popular on Twitter or something and a lot of people are coming into your site this also causes spikes and this also might cause the error rate to go above the threshold even though it just stayed the same as respect to the total traffic. So to avoid this we can just obviously take the errors relative to the total traffic. So we now divide what we did before by the total reduced rate and now can say if the fraction is greater than 1% then we want to be alerted which means we now defined for our entire service for all the instances a global special of 1% of records that are allowed to error before we page someone. And that's better because now the dashed line is sort of adaptive to our traffic. That's pretty nice but the problem here is that we are losing dimensionality. So we are losing detail and signals might cancel out. The simplest cases let's say you have a contact form which is probably rarely used with respect to your index page right and now your contact form can actually error every single time a user requests it but it will totally be lost in the noise of the index being such as the index page always starting successfully. So the sum can be this blue line here right it is like way way below the threshold but a certain pass or pass or request method can actually be erroring way too often. So you want to reserve this detail and for that we can actually preserve certain label dimensions of our metrics in the aggregations we do. So now we don't sum everything and we can see for every single pass whether it's above the special or not. That's still wrong obviously and intentionally because we have to take care of which dimensions we actually use here. If you have a Microsoft architecture you're sort of scaling horizontally for fault tolerance and scalability at all and a single instance failing is actually something you are totally willing to deal with right. I mean that's why you did this whole thing. So you might have one instance out of one thousandth that's acting up and erroring a lot but that's totally fine right it's like it's not going to impact your entire service you provide too much so you probably don't want to include this in your learning because your system is supposed to do this failure and not some operations engineer. So we can actually invert this aggregation condition and not say which labels you want to preserve but which labels we want to aggregate away and that's why we're powerful because it's easy to by accident drop a dimension if you sum by something but you have to be very very explicit about which dimensions you want to drop and you can just put in your aggregation the dimensions you have fault tolerant along and this can see at the service instance and it will preserve everything else. And what we now get is pairs of method and pairs and for each of these we now apply the thresholds and that's in the end what we wanted in the first place for our entire service we now have one rule that gives us a full latency detection on any bad conditions. So one alert covering an entire service or potentially even multiple services if they all have the same metrics um yeah that was the example for latency another thing is error rates um no uh sorry that was error rates sorry um latencies work similarly um you also can just calculate latencies for all your service instances and split this out for certain dimensions and apply dynamic thresholds. Another other different thing is um course-based warnings so running out of disk space running on a file descriptors running out of memory you want to detect these early and what you what usually systems do is they probe your hard drive check how much space is left 80 percent for example and if it's above the certain threshold they will tell you that this is a bad condition. There are several problems here um you usually set this to like 80 percent because if you set it to 90 and then this is filling up really fast here what have time to react that's why you set it way below what you actually could be using of your disk. Another case is say your disk is 10% full and it's filling up really really fast um now your alert won't hit until you hit like 80 percent and then you have no time to react because it will take another three minutes until your drive is actually full um so just probably the current state doesn't really get you anywhere at least at least not reliably and what you actually want is you want to look at a range of time and see how this resource is developing so it is this full but staying basically constant or it's filling up really fast um and we can do this right we can we can look at our time to use data which we collected over time um look at where's this user disk um how much we just space we have now and how much we had one hour ago and based on that development make a prediction where will it be in four hours and if this um gives you that it will be completely full then you can alert someone early without having any absolute specials and what promises does see in this case was linear prediction is it does linear regression on the interval of one hour we specify and then twice to estimate where we we will end up and this actually is one alert which covers every single hard drive in your entire infrastructure and you can apply similar alerting for file descriptors when we use it just etc and you also can see a nice use of annotations here um so we want to provide the receiver of the alert with some detail right just sending a disk full within four hours it's not really helpful um if you don't know which drive is actually affected um and of course this information is um preserved in the label dimensions and we can use these label dimensions that are every alert to define annotations which gives you which information about what's going on so here for example is a description and for every alert that's generated we generate one description which is templated using labels here and can specify um the alert you will get to specify very clearly which device is affected um on which mount point and on which machine so this was the promissory side these rules are put on your promissory machine or any other time series based system and are constantly evaluated and as soon as some of these yield any results um there's another and we want to push it out somewhere um and then in general case you could just push them directly to the user by sending an email uh contacting page duty or anything else um but that's not really meaningful or helpful um because you want to have redundancy you want to have multiple machines and everything the same alerts and you don't want to get duplicates um also you want to aggregate alerts just because alerts are not super noisy they can they can still be quite a lot about a single service and ideally you just want to get a single page telling you that something is wrong and then you have the rest of the time to react and won't spend it acknowledging pages or the same thing and also you want to do some sort of advanced routing and you don't want this to live directly in your monitoring system and that's how you have the alert manager and that sort of sits outside of Prometheus and it's an HAA component which runs on multiple nodes at once and it's communicating um and you send out all the alerts of the alert manager which then cares about deduplication um aggregation and routing and then dispatch notifications um about one or more alerts to different integrations like different chats page duty email option etc and so the visualizes a bit um let's say we have alerting rules sitting in some premises machines doesn't really matter which one but they're generating alerts and a single learning rule can potentially generate thousands of alerts every second in Syria at least um and what you get is a stream of stuff um high latency for service x in zone u west for this pass for this message um and then the next one is about the different paths and the next one is the duplication of that one um just the stream falling out um and that's not really what you want to receive right i don't want to get all these pages every few seconds so the alert manager lays in between this and receives all this stuff and tries to group it so if you have a certain service and have several alerts defined acting on the service the alert manager knows how to now take all these alerts belong to the same service and putting them into one notification and what you then get is something like an email or a page telling you you have 15 alerts for this certain service in this certain data center three of them are about latency tell of them are about average and two of them are about some catching so we're being slow and then you get a detail of all these alerts but you get one page telling you basically about one problem that belongs together and you won't basically spend all this time acknowledging acknowledging and also it knows how to do inhibition and that's sort of an advanced advanced feature um but let's say you have an alert that tells you that your data center is burning down that will probably affect quite a few things in this data center and they will probably all generate alerts and pages but now you are waking up 20 to 30 people for a problem they can't really deal with right there's one person that knows how to deal with the data center being on fire and nobody else really needs to do anything at this point that's just like one extreme example of inhibition but that's also a feature of the alert manager and of course it can handle a silencing for you so basically you get these alerts which are a set of labels and you can silence for certain periods of time along certain labels so if you're taking a service into maintenance you can just silence all the alerts that have the label service equals x so another way of detection um people still want this in a way for certain things um it's really not necessary if you just cover these four golden signals you have really reliable alerting um but you can do it and as I said before there's really no magic thing that detects from all your data or possible problems but you can do something practical right you can specify the problem domain um which nobody do want to detect um about which metrics and then you can define some sort of like unprecise soft estimations of things that might be wrong um but which might not necessarily indicate a serious problem and these can be helpful and interesting but they should probably never be paging so as one example and we have requests which are the blue line here and they're kind of spiky um but we want to look now on requests over the course of days being out of the normal so for example um if you're you want to alert if your requests uh you're getting currently uh 20 percent lower than there were last week at the same time um and in previous connection do that so we can calculate the request rate first on our request total metric uh and calculate this for the entire service and now we can apply hold winters which is expansion smooth it's an expansion smoothing function um which gives us this red curve here so all of this likes like a curve we get this smooth curve giving us some sort of accurate representation of traffic development over causes of days and now we can use these metrics we just generated um to calculate this alert that notifies us about traffic um is 20 percent up on out of the 20 percent range of the traffic left last week so um yeah we take our current request rate and then we take the smooth and curve we calculated from seven days ago and if the current request rate is out of the 20 percent range of this we want to alert so this theoretically is another detection if you depending on how you define it another example stolen from brian here um this is sort of just to show even if you specify very clearly what the condition is you want to alert on and what you consider abnormal even then it's really really hard to get right and this is a sort of like stress um how hard is this to actually automate completely so let's say we have um the latency in seconds for certain sales instance over the last five minutes and we want to alert um if any particular instance um has a latency that's out of the two standard deviation range of the average of all the other instances especially one instance has a high latency with respect to all the other ones and that's sort of the expression that you could use in Prometheus that's pretty advanced powerful etc um but now we realize that most of your instances have a latency that's almost exactly the same so they are very tightly clustered um and now even a small deviation of one single instance actually would trigger this because the standard deviation is really low so you can add another condition to basically catch this one edge case and you put it in the same expression and now you just specify that the latency of this abnormal instance also has to be has to be 20 percent in total higher than all the other ones so we come mind basically the the standard deviation with a rather um with the relative um yeah okay and then you realize that actually service instances that um are not recursed often are sort of cold they have cold caches they take very long so you only want to apply this to instances that are actually receiving more than one request per second and now we have a relatively complex expression um just basically to catch catch some edge cases um that you encountered right and imagining this being completely automated by any sort of machine learning system um you can kind of you can kind of guess how well this is going to go uh self-eating um also possible if you want so in theory Prometheus scrapes something gets metrics evaluates an alert sends a notification about this alert and the alert manager can actually hook up and notify any system you want so you can just write your own web book that alert manager sends alerts to that then evaluates these alerts expands the label sets checks what the what the alert is about um and can then take action so in theory let's say you have an antipedrift um and you just want an automatic system that just sort of restarts a note whenever this happens and you can define an alert on that alert manager notifies this just sort of small server that just reboots notes um and this one can take actions based on the alerts that came in so much for self-eating that can also be built in theory if you want to so in conclusion um you want symptom based pages everything you wake someone up about about um should be from the boundary of service and the use of the service and should be actionable anything cost based from internals of your service or dependencies of it um is helpful during the investigation um but it should just be a warning flashing somewhere that does not interrupt someone's work or sleep uh and your alerts should always be adaptive to change they should preserve as many dimensions as they can but be very sure to aggregate away dimensions and that your faults are on the long and that you want to ignore and for anything related to capacity planning or saturation detection you want to use linear prediction yeah and if you want to you can do a normal detection there are alerting expressions are very powerful and you can go as far as you want but you probably don't need it and you certainly should have a page on it um and the raw alert generated by these learning rules should not be consumed directly by human there should always be some sort of intermediate layer like the alert manager actually does aggregation and meaningful routing of alerts that's it so are there any questions probably put your hand up there's number one I saw so two quick questions first is um around the alert manager um can you submit alerts into it from other systems other than Prometheus or is it better to go into Prometheus and into the real world it has an API so you can in theory um there are some semantical considerations that take um it's documented um you can't just send anything you have to address some rules but it's possible yeah okay and then the second question is around inhibition the alert inhibition um can you do that at runtime or is that something I mean so I see a load of things so yeah inhibition can you send a message to again the alert manager to say there's a short term inhibition until another state happens or they are not time-based they are purely label-based and are configured um but you can theoretically just change the configuration file and reload it it sounds like you're looking for a silence yeah contemporary suppress give me a larger second question is down here we want to many many talk how many nodes in the cluster are offline for example in this case should be done using the description of alert or in alert manager should aggregate and decide whether to notify human you for like 20% of nodes in the cluster are offline um so the question is who's responsible of actually notifying humans then then which module the alert manager or the alert itself should include a rule that certain number of nodes are offline the aggregation threshold as the question is where to do the aggregation in the loading rule versus in the alert manager that always depends a bit right ideally you want to aggregate the least amount you can in the loading rules such as like the dimensions of fault tolerance like the instance in a microservice you should you should do the least amount of aggregation in the loading rule but as much as necessary basically congregating the questions in one section some of the rules that you're using to those queries you're using to define alerts looked pretty sophisticated do you have a way of testing those without actually having to deploy them to real systems so that you know that so we have come up this it's squeaky the question is do we have unit tests for rules the question was do we have unit tests for rules now what you want to that's basically it okay I missed a bit of the talk so I hope you didn't cover it already but sometimes the the exporters of data from Provincios when the application fails but the exporter is still learning they just do not export new metrics so how how would be a good way to have an alert that triggers when no new metric appears on that metric well no new values for that metric yes so in general so the question is if we have an exporter that exports metric or an application and the application is down but the exporter said that I have an exporting metrics how do we alert on the application being down and the general case is just you have the exporter having a metric where that can reach the application itself and then you alert on this single the time series on metric any other questions so on this side please and also run around again any more questions going once going twice on thank you Fabian so the next talk is in in seven minutes isn't it or 25 past and there's still stickers here from Regius and if you aren't talking about the Regius conference you can grab a stack